Analytics with SQL: Have we met before?
~ Identifying new & returning users with SQL
Backstory:
As you know from my previous posts, I’ve started a mailing list to promote my blog. In the last post in this series, we designed an approach to fix the data inconsistencies in our mailing list’s subscriber history. If you haven’t checked it out already, you can find that here. This time we’ll be doing something a little different by moving beyond subscriber analytics to the broader customer analytics realm. The aim is to better understand how we define a user and differentiate between a new and a returning user. What we’ll be doing today will feel like trying to remember someone you think you’ve met before but can’t remember when. So get ready to jolt your memory and let’s dive in!
Problem:
You must’ve seen social media platforms report on their yearly total user count broken down by new and returning users. I’m trying to do something similar for my blog, I’m looking for the accurate count of users that have visited my blog, how many of these were new, and how many were returning users. Now a user for a social media platform is anyone who creates an account but my blog doesn’t require you to sign up to read so I can’t use the same approach. So I’ll have to be creative as to how I differentiate new and returning users to get the required counts.
Data:
Every platform has a way they acquire customers, for example, for an eCommerce website it’ll be the transactions captured. For my blog, this will be our old friend clickstream data which we’ve worked with in the past. But this can also help us determine whether a user is new or returning. Here’s what this data looks like:
+-----+-------------+--------+-----------+-----------+------+-----+
| # | ip | mac_id | device_id | region_id | date | ... |
+-----+-------------+--------+-----------+-----------+------+-----+
| 1 | 101.102.1.5 | 97tg2r | xyz | 4 | 1/15 | ... |
| 2 | 101.102.3.7 | gu98o2 | eub | 5 | 1/15 | ... |
| 3 | 101.102.6.6 | oufg2q | eiu | 3 | 1/15 | ... |
| 4 | 101.102.9.9 | ghiof3 | ppp | 3 | 1/15 | ... |
| 5 | 101.102.0.0 | hgoi82 | efs | 2 | 1/15 | ... |
| 6 | 101.102.3.7 | gu98o2 | eub | 5 | 1/10 | ... |
| 7 | 101.102.9.9 | ghiof3 | ppp | 2 | 1/10 | ... |
| ... | ... | ... | ... | ... | ... | ... |
+-----+-------------+--------+-----------+-----------+------+-----+
Here the combination of the fields (ip+mac_id+device_id+region_id*) can help us identify a unique user which makes it the key. However, one thing to note is that the region_id may be different as the user may access the blog from a different region a second time.
STOP! Before reading ahead here’s where I want you to take some time and think about the problem yourself before reading my approach.
My approach:
The first thing I need to identify new v/s existing users is to know if the specific combination of keys has been repeated earlier. For this, I’m looking to extend the clickstream table with an array of fields that tell me about the last activity from this user. Something like this(Here ‘la’ stands for last activity):
+------+----+-------------+-----------+--------------+-------------+
| # | .. | la.ip | la.mac_id | la.device_id |la.region_id |
+------+----+-------------+-----------+--------------+-------------+
| 1 | .. | - | - | - | - |
| 2 | .. | 101.102.3.7 | gu98o2 | eub | 5 |
| 3 | .. | - | - | - | - |
| 4 | .. | 101.102.9.9 | ghiof3 | ppp | 2 |
| 5 | .. | - | - | - | - |
| 6 | .. | - | - | - | - |
| 7 | .. | - | - | - | - |
| .. | .. | .. | .. | .. | .. |
+------+----+-------------+-----------+--------------+-------------+
From this I see #2 is a returning user if I consider the four-part key of (ip+mac_id+device_id+region_id), however, since the region_id may change for visits, I should also consider a three-part key of (ip+mac_id+device_id) which would make #4 a returning user too whereas others are new users. Also if there were to be a different match on both these keys, we need a way to give preference to the four-part key over the three-part since that better defines a user. So let’s try extending our clickstream data with this last_activity array. Here’s how you’d do that in SQL:
-- Unique user visits by day
WITH user_visits AS (
SELECT ip, mac_id, device_id, region_id, date
FROM clickstream
GROUP BY ip, mac_id, device_id, region_id, date
),
-- Last activity STRUCT based on four-part key
user_last_activity_four_key AS (
SELECT *,
STRUCT(
LAG(ip) OVER(PARTITION BY ip, mac_id, device_id, region_id ORDER BY date) AS ip,
LAG(mac_id) OVER(PARTITION BY ip, mac_id, device_id, region_id ORDER BY date) AS mac_id,
LAG(device_id) OVER(PARTITION BY ip, mac_id, device_id, region_id ORDER BY date) AS device_id,
LAG(region_id) OVER(PARTITION BY ip, mac_id, device_id, region_id ORDER BY date) AS region_id
) AS last_activity,
1 AS solution_rank
FROM user_visits
),
-- Last activity STRUCT based on three-part key
user_last_activity_three_key AS (
SELECT *,
STRUCT(
LAG(ip) OVER(PARTITION BY ip, mac_id, device_id ORDER BY date) AS ip,
LAG(mac_id) OVER(PARTITION BY ip, mac_id, device_id ORDER BY date) AS mac_id,
LAG(device_id) OVER(PARTITION BY ip, mac_id, device_id ORDER BY date) AS device_id,
LAG(region_id) OVER(PARTITION BY ip, mac_id, device_id ORDER BY date) AS region_id
) AS last_activity,
2 AS solution_rank
FROM user_visits
),
-- Combining the two last_activity structs with priority
user_last_activity AS (
SELECT *,
RANK() OVER(PARTITION BY ip, mac_id, device_id, region_id, date ORDER BY solution_rank) AS solution_rank
FROM (
SELECT * FROM user_last_activity_three_key
UNION ALL
SELECT * FROM user_last_activity_four_key
)
)
-- Selecting the best unique last_activity match for a user
SELECT ip, mac_id, device_id, region_id, date, ARRAY_AGG(last_activity) AS last_activity
FROM user_last_activity
WHERE solution_rank = 1
Let’s break down the query a little, we start with the unique IDs we’ve captured by day. Then considering a four-part key, we find out the value of each field for the last occurrence of this key using the LAG function. We then repeat the same but with a three-part key this time(excluding the region_id field in the PARTITION clause). We also package these fields in a STRUCT variable called last_activity. Then to ensure we only get the best match of the last activity when we combine the results from the two keys we RANK the approaches with priority. Finally, we convert our last_activity STRUCT to an ARRAY, this helps us ensure all the last activities for a visit are grouped together. Arrays come in handy when you want to capture multiple values for a column on the same row, for example, in eCommerce if multiple cards were used for payment on a transaction an array field can be used to store the info of all the cards used on the same row for that transaction.
From this now I can derive new users by looking at the last_activity, if it’s Null, it’s a new user. The next step would be to feed this info to a database which would generate not just a user_id but a user profile for this user(more on this later). Now the clickstream data would be one way of generating a profile, if tomorrow I start selling merch on my blog, I can leverage the eCommerce data as a source for this profile creation process. The aim will be to have a user_profile generation pipeline fed by these tables. Here’s a glimpse of what our new user_profile table will look like:
+---------+--------------+-------------+------------+-----+
| user_id | profile_type | web.ip | web.mac_id | ... |
+---------+--------------+-------------+------------+-----+
| 341 | WEB | 101.102.1.5 | 97tg2r | ... |
| 128 | WEB | 101.102.6.6 | oufg2q | ... |
| 221 | WEB | 101.102.0.0 | hgoi82 | ... |
| 877 | WEB | 101.102.3.7 | gu98o2 | ... |
| 234 | WEB | 101.102.9.9 | ghiof3 | ... |
| ... | ... | ... | ... | ... |
+---------+--------------+-------------+------------+-----+
Now that I’ve got a way to identify users, I can extend the clickstream data with the user info as below. The reason I choose a view here is that our user_profile table is dynamic and should be ever-growing with new profiles which would help us identify both new and returning users leveraging the same table.
-- A view with user_id + clickstream info
CREATE VIEW user_clickstream AS (
SELECT up.user_id, clk.*
FROM clickstream AS clk
LEFT JOIN user_profile AS up
LEFT JOIN UNNEST(web) AS wb
ON clk.ip = wb.ip
AND clk.mac_id = wb.mac_id
AND clk.device_id = wb.device_id
AND clk.region_id = wb.region_id
)
Finally, I can leverage this view as below to get the counts that I’m looking for.
-- Total, new & returning user counts for past year
SELECT
COUNT(DISTINCT user_id) AS total_users,
COUNT(DISTINCT CASE WHEN last_activity.ip IS NULL THEN user_id ELSE NULL END) AS new_users,
COUNT(DISTINCT CASE WHEN last_activity.ip IS NOT NULL THEN user_id ELSE NULL END) AS returning_users
FROM user_clickstream, UNNEST(last_activity) AS last_activity
WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 YEAR)
Food for thought:
What if multiple user_ids can be linked to a single clickstream event? How do you resolve this?
What if there are multiple visits from the same user on a single day? The ORDER BY date in the last_activity calculation may cause issues, how do we account for this?
The new user assignment process may still map the same individual as two different users if they access with different devices, how do we resolve this?
Why maintain a user_profile table like the one mentioned? What use cases does it solve?
Let me know how you’d approach it. If you found this helpful share it. If you’re into this, you can find me on Twitter @abhishek27297 where I talk data.