Analytics with SQL: What keeps them coming back?
~Using Arrays in SQL to analyze user behavior
Backstory:
As you know from my previous posts, I’ve started a mailing list to promote my new blog. In the last post, we designed a report for me to analyze how my mailing list is performing, if you haven’t read that already check it out here. Now, I’m also interested in getting insights into user behavior to see how users interact with my blog which would help me create better content. To start with, I’m interested in knowing if users are re-reading my older blogs in their visits as I only promote my latest blogs. If this has substantial weight to it, it would make sense to promote the older ones too.
Problem:
As always I don’t have this available directly but what I do have available is clickstream data from my blog that indicates what users clicked on in their visit. From here I want to get to a point where I can make conclusions about user behavior concerning re-visits on my blog.
Data:
Here’s a simplified version of the clickstream data:
session_id is unique for every page visit(key for this table)
visit_id is unique for a single user visit
url is the URL with the blog name which was clicked on
+------------+----------------+----------+---------+---------------+
| session_id | date_time | visit_id | user_id | url |
+------------+----------------+----------+---------+---------------+
| gy8r4 | 10/10 14:35:27 | 4fbje | 123 | blog/seeds/ |
| 97r4g | 10/10 14:31:10 | 4fbje | 123 | blog/sleep/ |
| h08g | 10/10 13:09:08 | gt329 | 101 | blog/yogurt/ |
| h186 | 10/10 13:05:18 | gt329 | 101 | blog/sleep/ |
| ... | | | | |
| g49fi | 10/05 09:30:18 | 78g4f | 123 | blog/oatmeal/ |
| 0g824 | 10/05 09:28:34 | 78g4f | 123 |blog/ice_cream/|
| 08h4f | 10/05 09:25:07 | 78g4f | 123 | blog/seeds/ |
+------------+----------------+----------+---------+---------------+
What I want to get to:
+--------------+-----------------------+-------+
| first_opened | reopened | count |
+--------------+-----------------------+-------+
| sleep | ['seeds','ice_cream'] | 150 |
| sleep | ['yogurt'] | 136 |
| ... | | |
| oatmeal | null | 10 |
+--------------+-----------------------+-------+
STOP! Before reading ahead here’s where I want you to take some time and think about the problem yourself before reading my approach.
My approach:
To solve this problem I used arrays, like other problems this can be done in many ways without using arrays but I prefer the outcome with arrays. If you’ve never heard of arrays in SQL, here’s where you can learn more about it -> Arrays in BigQuery.
Let’s come back to the first_opened part later on. For now, looking at the clickstream data it’s clear that we have a record for every user action, let’s assume the clickstream table is distilled to blog clicks only. So we need a way to aggregate this info for a single visit for each user. When we think about aggregating data the go-to is numerical aggregation which involves summing, averaging, and all that jazz, however, here we need to aggregate the data to combine multiple rows in a single row without losing granular details, unlike the usual case. This is why we go with array aggregation here, the query for this looks something like this.
SELECT DATE(date_time) AS date, visit_id, user_id, ARRAY_AGG(opened) AS opened
FROM (
SELECT date_time, visit_id, user_id, SPLIT(url,'/')[OFFSET(1)] AS opened
FROM clickstream
)
GROUP BY DATE(date_time), visit_id, user_id
This gets us up to here:
+-------+----------+---------+---------------------------------+
| date | visit_id | user_id | opened |
+-------+----------+---------+---------------------------------+
| 10/10 | 4fbje | 123 | ['sleep','seeds'] |
| 10/10 | gt329 | 101 | ['sleep','yogurt'] |
| ... | | | |
| 10/05 | 78g4f | 123 | ['seeds','ice_cream','oatmeal'] |
+-------+----------+---------+---------------------------------+
Now that we have an array column with all blogs the user opened in a visit, the next step is to determine what the same user opened in their last visit to the site. Here’s where good-old window functions come into play. To get the opened for the user’s previous visit we use the LAG function as follows.
LAG(opened) OVER(PARTITION BY user_id, visit_id ORDER BY date) AS lag_opened
This gets us blogs that the user opened in their previous visit, which for user_id=123 looks like this:
+-------+----------+---------+-------------------+-----------------+
| date | visit_id | user_id | opened | lag_opened |
+-------+----------+---------+-------------------+-----------------+
| 10/10 | 4fbje | 123 | ['sleep','seeds'] |['seeds',...] |
+-------+----------+---------+-------------------+-----------------+
But we’re looking for what was reopened, that is what’s common between the two arrays opened and lag_opened. Now there’s no direct way to get this so let’s break it down. To access a particular element in arrays you can use OFFSET as we did before to get the blog name from the URL. In this case, we’re looking for a comparison between all the elements from the two arrays, this would require us to first open these arrays with UNNEST, and perform an intersection between the two using INTERSECT DISTINCT, and finally package it back as an array to maintain the grain. Doing all this in SQL looks like this:
ARRAY(
SELECT * FROM UNNEST(opened)
INTERSECT DISTINCT
SELECT * FROM UNNEST(lag_opened)
) AS reopened
This would finally get us the following:
+-------+----------+---------+------------------------+------------+
| date | visit_id | user_id | opened | reopened |
+-------+----------+---------+---------------------------------+----
| 10/10 | 4fbje | 123 | ['sleep','seeds'] | ['seeds'] |
| 10/10 | g794r | 101 | ['sleep','yogurt'] | ['yogurt'] |
| ... | | | | |
| 10/05 | 78g4f | 123 | ['seeds','oatmeal',..] | null |
+-------+----------+---------+------------------------+------------+
Now we’re at a point at which we have an array with blogs that were reopened at a visit_id level for every user. This brings us closer to the end goal, the next step is to get the first_opened blog at a visit_id level for every user. To do this we can utilize the date_time field and get the blog name for the user’s first visit by using the RANK function in SQL.
SELECT visit_id, user_id, opened AS first_opened
FROM (
SELECT visit_id, user_id, SPLIT(url,'/')[OFFSET(1)] AS opened,
RANK() OVER(PARTITION BY visit_id, user_id ORDER BY date_time) AS open_rank
FROM clickstream
)
WHERE open_rank = 1
Combining the two steps would easily get us the counts we’re looking for. So let’s put it all together in a single query that takes us from the clickstream data to the desired analytics.
WITH clickstream_opened AS (
SELECT
date_time, visit_id, user_id,
SPLIT(url,'/')[OFFSET(1)] AS opened,
RANK() OVER(PARTITION BY visit_id, user_id ORDER BY date_time) AS open_rank
FROM clickstream
),
clickstream_opened_arrays AS (
SELECT
DATE(date_time) AS date, visit_id, user_id, ARRAY_AGG(opened) AS opened
FROM clickstream_opened
GROUP BY DATE(date_time), visit_id, user_id
),
clickstream_lag_opened AS (
SELECT *,
LAG(opened) OVER(PARTITION BY user_id, visit_id ORDER BY date) AS lag_opened
FROM clickstream_opened_arrays
),
clickstream_reopened AS (
SELECT * EXCEPT(lag_opened),
ARRAY(
SELECT * FROM UNNEST(opened)
INTERSECT DISTINCT
SELECT * FROM UNNEST(lag_opened)
) AS reopened
FROM clickstream_lag_opened
),
clickstream_first_opened AS (
SELECT visit_id, user_id, opened AS first_opened
FROM clickstream_opened
WHERE open_rank = 1
)
SELECT first_opened, reopened, COUNT(1) AS count
FROM clickstream_first_opened JOIN clickstream_reopened
USING(date, visit_id, user_id)
GROUP BY first_opened, reopened
ORDER BY COUNT(1) DESC
Food for thought:
How do I use these results? What decisions can be made?
Here, there’s an underlying assumption that users only have a single visit in a day. This may not be true, in which case, how can I account for multiple visits by a single user on the same day?
How can I add a timeframe component to this analysis? To break it down by a timeframe for example quarters.
Here we start from the blog clicks only but given the entire clickstream data, how can I filter unwanted clicks like homepage, socials, and so on?
Let me know how you’d approach it. If you found this helpful, share it. If you’re into this, you can find me on Twitter @abhishek27297 where I talk data.