Analytics with pandas: What’s working?

Feb 07, 2022

~ Building subscriber acquisition report using pandas

Backstory:

As you know from my previous posts, I’ve started a new blog. In the last post, we looked at user behavior on the blog and specifically answered the question of whether promoting recent blogs also brings traffic to the older blogs, if you haven’t read that already check it out here. Now, I’m also interested in getting insights about which social platform is bringing the most traffic and is there a trend here across different blogs, this will help me focus my efforts on channels that work for promoting my blog.

Problem:

As always I don’t have this data available directly but like always we’ll work with what we have. What I do have is visitor logs from my blog which have some info about the source telling us how the user landed on my blog.

Data:

Here’s a simplified version of the visitor_logs data:

visit_id is unique for a single user visit(key for this table)
visit_timestamp is the DateTime of the visit
url is the URL with the blog name which was clicked on
source_url is the URL the user clicked on to reach the blog(simplified version)

+----------+--------------------+---------+---------+--------------+
| visit_id |  visit_timestamp   | user_id |  url    |   source_url |
+----------+--------------------+---------+---------+--------------+
| tr739    | 1/12/2022 14:35:27 | f73r9   |../sleep/|instagram.com |
| gu3fd    | 1/12/2022 10:05:17 | ufvwq   |../sleep/|linkedin.com  |
| g83o     | 1/12/2022 9:30:00  | ofibq   |../sleep/|twitter.com   |
| fi2k2    | 1/12/2022 9:10:20  | ufvwq   |../sleep/|instagram.com |
| 0932h    | 1/1/2022 16:07:17  | g5hy4   |../sweet/|linkedin.com  |
| ih3fo    | 1/1/2022 13:15:27  | hh83r   |../sweet/|twitter.com   |
+----------+--------------------+---------+---------+--------------+

What I want to get to:

STOP! Before reading ahead here’s where I want you to take some time and think about the problem yourself before reading my approach.

My approach:

So like the other problems I’ve worked on in the past, you can take different approaches here. Since this problem involves some complex operations which aren’t super straightforward in SQL, I opt to use python in this case, more specifically pandas which is python’s data-analysis library. If you haven’t heard of it (which is surprising) read here. The other reason why I opted for python was that I also had the end result emailed to me on a weekly basis which is possible with python. Also just a caveat here, I won’t be including stuff like dependencies and all that jazz here because I try to focus on the core logic instead, but of course, without all that nothing here would run.

So let’s dive in, starting with how I got the data in a pandas dataframe in the first place. Here I did use SQL to fetch the data, pandas has a way to connect to any database and run a SQL query whose result is then stored in a dataframe. Below is how I connected to BigQuery.

query = """
SELECT *
FROM visitor_log;
"""
df = pd.read_gbq(query, project_id="kale-and-quinoa", verbose=True,progress_bar_type="tqdm_notebook",dialect="standard")

This gets me the visitor_log table in a dataframe with which now I can do all sorts of things. A caution here, in this case, because my data is not that much I don’t use any filters in my query to limit the results. If you have a lot of data, pandas may choke and that’s where you move to the more advanced big data realm, more on that in the future.

So the way I like to go about these is to first clean up the data and then get things that I need that can be derived from other pieces. Here we need to make sure the visit_timestamp is usable since our solution depends on it. Thinking about things we can derive, we can get the source of the visit and the blog name clicked on from the respective URLs by basic string manipulations.

To do this I use the lambda function and the apply function, the former is a short-hand way of writing functions and the latter lets us apply these functions to the entire dataframe easily.

df['visit_timestamp'] = pd.to_datetime(df['visit_timestamp'])
df['source'] = df['source_url'].apply(lambda x: x.split(".")[0])
df['blog'] = df['url'].apply(lambda x: x.split("/")[1])

Let’s now jump into things that aren’t super straightforward but are needed for the final result. Starting with the past week’s analytics, half of the result depends on these. To do so let’s first think about what we’re trying to count, it’s the number of unique visits which is the visit_id field. Now we need some way to distinguish if this visit was in the past week from when the results are being generated. We can do this by utilizing the visit_timestamp field as follows.

df['visit_id_past_week'] = df[['visit_id','visit_timestamp']].apply(
 	lambda x: 
  	x['visit_id'] if (datetime.now() x['visit_timestamp']).days <= 7      
  	else None, 
    axis=1
  )

Now running a count on visit_id_past_week only gets us visits in the past week ignoring the NULLs(None).

Before jumping onto the counts, let’s think about are the visits actually unique. From the data, you may notice the user_id=ufvwq opened the same blog twice from two different platforms(must be a real fan).

+--------+------------------+-------+----------+---------------+
|visit_id| visit_timestamp  |user_id|   url    |  source_url   |    
+--------+------------------+-------+----------+---------------+
| gu3fd  |1/11/2022 10:05:17| ufvwq |../sleep/ | linkedin.com  |
| fi2k2  |1/12/2022 9:10:20 | ufvwq |../sleep/ | instagram.com |
+--------+------------------+-------+----------+---------------+

Now here’s where interpretation of counts is a little subjective and should be clarified when reporting on such data. For this case, I only want to count visits from users where they first encountered the blog. Stated otherwise for every user-blog unique combo, I want the first visit(derived by visit_timestamp). Since the data here is sorted by visit_timestamp in descending order, this would mean the last row(index=-1 in python) for every user_id+blog combo. Here’s what this would look like in code.

df = df.groupby(['user_id','blog']).nth(-1).reset_index()

Here’s what the data looks like after all that work.

+----------+--------------------+-----------+-----------+
| visit_id | visit_id_past_week | blog      |  source   |
+----------+--------------------+-----------+-----------+
| tr739    | tr739              | sleep     | instagram |
| g83o     | g83o               | sleep     | twitter   |
| fi2k2    | fi2k2              | sleep     | instagram |
| 0932h    | null               | sweet     | linkedin  |
| ih3fo    | null               | sweet     | twitter   |
+----------+--------------------+-----------+-----------+

We’re finally at the point where we can start building the desired report. You may notice the logical next step is somehow transforming the source column into multiple different columns with sources. This transformation is reshaping the data from a long to wide format, a rather common one in data analysis, here’s an article that goes into this. The way you do this in pandas is by using the pivot function.

The first step is to go from raw data to aggregated counts for the report. Here we’re looking for counts of total and past week visits at the grain blog + source.

#Step 1
acq_report = df.groupby(['blog','source'])\
	.nunique([['visit_id','visit_id_past_week']]\
    .reset_index()
  
acq_report.rename(
	columns = {
    	'visit_id':'totalVisits',
    	'visit_id_past_week':'pastWeekVisits'
    },
    inplace=True
)

Now that we have the counts we can reshape the data as below. Here index is our final grain (blog name is this case), columns are the column to be reshaped, and values are numbers that are broken down.

#Step 2
acq_report = acq_report.pivot(
    index='blog',
    columns='source',
    values=['totalVisits','pastWeekVisits']
).reset_index()

The results from here after doing a little renaming for the above records look as follows.

Now we’re missing the actual totals that we started with since we broke it down while reshaping and the percent calculations. Here to do so I use a little regex, as it scales better rather than writing individual column names.

acq_report['totalVisits'] = acq_report.filter(
	regex='totalVisits.*',axis=1
).sum(axis=1)

acq_report['% LinkedIn'] = round(
    (acq_report['totalVisits linkedin']/acq_report['totalVisits'])*100,2
)
# Similarly for others

This now gets us everything we want from our report, we can make it look pretty with a little reordering and renaming. Having this report now gives me insights into how the different social media platforms are performing for marketing my blog and lets me make data-driven decisions as to which platform can I focus on for promoting my blog.

Food for thought:

Earlier in the blog I also hint at automating this report in an email to myself on a weekly basis, how can one do so with python?
When doing data analysis we mostly read data from files, however, this may not always be the case. You may need to read data from other sources like an API, a webpage, a graph database, and so on. Think about how would this work?
Earlier in the blog, I hint at pandas failing when reading the data due to size constraints, how would you approach this problem if that was the case?

Let me know how you’d approach it. If you found this helpful share it. If you’re into this, you can find me on Twitter @abhishek27297 where I talk data.

AB writes

Discussion about this post