Making Data Work for You: The Power of Metrics
Backstory:
As you know from my previous posts, I’ve been developing data solutions for my blog to provide my readers with a better experience. In the last post in this series, we took an old beat-up Subaru and turned it into a Ferrari by transforming it for performance and integrity, see here. This time we’ll take the previously designed Ferrari and give it Hyundai’s reliability & usability. We’ll do this by adding metrics to our user_clickstream table to make it more useful while maintaining its performance and integrity. Enough of car talk, let’s get ready to turn our data into a goldmine with the power of metrics.
The Problem:
While our newly designed user_clickstream, which combines clickstream and user info, is faster and more accurate, it’s not that useful. The reason is that the only insight derived from this is which user made a particular click, while this is helpful, not much can be done with it. Here’s where to make this more helpful I’m looking to add metrics to this data which will help me derive more useful insights from it like what brings back users or is there a pattern in user visits on my blog. However, I don’t want to compromise on the performance or integrity achieved from our last effort.
The Data:
Here’s the user_clickstream table that ties clickstream events to users:
+------+---------+------------+-------------+---------+-----+
| date | user_id | profile_id | ip | device | ... |
+------+---------+------------+-------------+---------+-----+
| 1/1 | 123 | xyz | 101.102.1.5 | android | ... |
| 1/9 | 123 | pqr | 101.102.7.7 | windows | ... |
| 1/17 | 123 | xyz | 101.102.1.5 | android | ... |
| 2/10 | 456 | abc | 101.102.9.9 | ios | ... |
| 3/27 | 123 | xyz | 101.102.1.5 | android | ... |
| ... | ... | ... | ... | ... | ... |
+------+---------+------------+-------------+---------+-----+
STOP! Before reading ahead here’s where I want you to take some time and think about the problem first.
My approach:
First, let’s think about what metrics can we add, the aim here is to add metrics that would help us learn more about our users’ behavior. To do so I like to think about potential questions that we’ll be looking to answer with the data for example — Is this a new user or a returning user? Or more specifically which visit is this for the user? Is the user a new fan or a veteran supporter? This exercise gets me the following metrics:
+------------------------+--------------------------------------------+
| Metric | Definition |
+------------------------+--------------------------------------------+
| visit_count | Visit number for a particular user |
| new_user_indicator | Indicator identifying a new user |
| renewed_user_indicator | Indicator identifying a "re-newed" user |
| fan_indicator | Indicator identifying a user as a "fan" |
+------------------------+--------------------------------------------+
Now that we have our metric definitions out of the way, let’s expand on these to actually build them. As each record in our table is a ‘click’ event, the visit_count is merely the record number for a particular user which is the rank of the click event as compared to other events by the user, the first-ever visit gets visit_count = 1. The new_user_indicator as you may guess is an indicator signifying if the user is a first-time visitor so this would be “Y” when the visit_count=1. Next up the renewed_user_indicator indicates if the user is someone that had forgotten about my blogs but is back after a hiatus, using this I can understand what brought them back. To determine this we’ll need a time window after which a user can be termed as “renewed”, let’s consider this to be 2 months for my blog. Finally, a “fan_indicator” will identify users that are fans, if you’re wondering how I make this classification, see here. The next step is to build these with SQL.
To get the visit_count I need to rank all visits for every user, this can be done using the RANK function. The new_user_indicator boils down to an IF condition using visit_count. To calculate the renewed_user_indicator, I need to compare each visit’s date to the previous visit to determine if the difference falls outside the 2-month window for a user to be “renewed”. Finally, fan_indicator requires a lookup on the user_rfm table we built here to classify the user as a fan. Here’s what all this looks like in SQL:
WITH uc AS (
SELECT
date,
user_id,
RANK() OVER(PARTITION BY user_id ORDER BY date) AS visit_count,
LAG(date) OVER(PARTITION BY user_id ORDER BY date)
AS previous_visit_date,
user_rfm.*
FROM user_clickstream
LEFT JOIN user_rfm
USING(user_id)
)
-- Metrics:
SELECT
date,
user_id,
visit_count,
CASE WHEN visit_count = 1 THEN "Y" ELSE "N" END AS new_user_indicator,
CASE WHEN DATE_DFF(date, previous_visit_date, MONTH) > 2
THEN "Y" ELSE "N" END AS renewed_user_indicator,
CASE WHEN (frequency > 6 AND recency > 6 AND depth > 6)
THEN "Y" ELSE "N" END AS fan_indicator
FROM uc
Here’s what the result from this looks like:
+-----+--------+------------+-------------+-----------------+--------+
|date |user_id |visit_count |new_user_ind |renewed_user_ind |fan_ind |
+-----+--------+------------+-------------+-----------------+--------+
| 1/1 | 123 | 1 | Y | N | N |
| 1/9 | 123 | 2 | N | N | N |
| 1/17| 123 | 3 | N | N | N |
| 2/10| 456 | 1 | Y | N | Y |
| 3/27| 123 | 4 | N | Y | N |
| ... | ... | ... | ... | ... | ... |
+-----+--------+------------+-------------+-----------------+--------+
This is great, we’ve designed a solution to calculate user metrics for our clickstream data, however, sorry to burst your bubble but this solution would NEVER work. If you’re wondering why, I have two words for you, BIG DATA, the same reason we had to convert user_clickstream into a table from a view last time. With a large volume of data, the calculation for the visit_count that uses the window function with the OVER clause would take forever to execute.
So let’s think about how we can calculate these metrics in finite time. If you remember from last time, we designed the user_clickstream table to be loaded in daily batches to make it manageable. Similarly, for our metrics, we don’t need to calculate metrics for the entire table together rather we just need to do the computation for the latest batch of clickstream data. However calculating metrics for the latest batch is not as straightforward as identifying the user.
To calculate the visit_count for a particular visit I need to know some history about the user’s prior visits which would act as a reference for my calculation, similarly, for the renewed_user_indicator I need to know the user’s last visit date irrespective of the batch. To do this we need to do a small calculation before calculating metrics where we calculate the reference visit_count for a user and their last visit date if one exists. Our metrics calculation for the batch would then use this info. But enough talk, let’s extend our user_clickstream data flow from last time to account for metrics:
And here’s what this looks like in SQL:
-- Step 1:
CREATE OR REPLACE TABLE metrics_benchmark AS
SELECT user_id,
MAX(visit_count) AS max_visit_count,
MAX(date) AS last_visit_date
FROM user_clickstream
GROUP BY user_id
-- Step 2:
INSERT INTO user_clickstream
WITH clickstream_metrics AS (
SELECT *,
-- If visit_count exists add onto it else start from scratch
CASE WHEN max_visit_count IS NULL
THEN RANK() OVER(PARTITION BY user_id ORDER BY date)
ELSE max_visit_count + RANK() OVER(PARTITION BY user_id
ORDER BY date)
END AS visit_count,
-- Backfill last_visit_date with current data
CASE WHEN LAG(date) OVER(PARTITION BY user_id ORDER BY date) IS NULL
THEN last_visit_date
ELSE LAG(date) OVER(PARTITION BY user_id ORDER BY date)
END AS previous_visit_date,
user_rfm.*
FROM user_clickstream_stage
LEFT JOIN metrics_benchmark
USING(user_id)
LEFT JOIN user_rfm
USING(user_id)
WHERE user_clickstream.date = CURRENT_DATE()
)
SELECT *,
visit_count,
CASE WHEN visit_count = 1 THEN "Y" ELSE "N" END AS new_user_indicator,
CASE WHEN DATE_DFF(date, previous_visit_date, MONTH) > 2
THEN "Y" ELSE "N" END AS renewed_user_indicator,
CASE WHEN (frequency > 6 AND recency > 6 AND depth > 6)
THEN "Y" ELSE "N" END AS fan_indicator
FROM clickstream_metrics
Let’s break this down, we start by grabbing the latest visit_count & date for each user and store it in metrics_benchmark. Then assuming results from the user match are in user_clickstream_stage, we extend the previous metric calculation using metrics_benchmark by adding to the last visit_count and backfilling the previous_visit_date if one exists. As a result, the window function is run over only batch data from the present day. This finally gets us the metrics we need on a scalable basis.
However, we are still forgetting something, remember that our knowledge of users evolves as they interact more with my blog. A user profile belonging to one user may move to another user as we learn more about the user. This is why we designed a process to merge users, see here. Now to make our metrics fireproof we need to account for these user changes.
Let’s take an example from above if our process determines that user profiles abc and xyz should both fall under user_id = 123, to update user_clickstream with this change would not just be changing the user_id but we would need to update all metrics associated with these users. In this case, user_id=123 is gaining a profile so their total visits would increase, and vice-versa for user_id=456, thus we have to recalculate metrics for both these users. Also, since the user change will affect all the trips we can use the first approach to calculate metrics and merge our changes back into user_clickstream.
TL;DR
Here’s what we did today — we started by brainstorming metrics that we can add to the clickstream data from my blog to gain insights about user behavior. Then we looked at metrics definitions and how to build them using SQL. Next, we extend our approach to implement it in a scalable manner with big data thinking about performance. We do this by designing a new data flow and updating our previous query to calculate metrics in batches. Finally, we discuss how we can account for user changes with these metrics to make our metrics fireproof.
Food for thought:
What are some metrics that you would add to the user_clickstream?
How would you use these metrics to get insights about user behavior on my blog?
How does metrics_benchmark help us with performance?
How would the data flow look like for calculating metrics driven by user changes?
Let me know how you’d approach it. If you found this helpful, share it. If you’re into this, you can find me on Twitter @abhishek27297 where I talk data. If you don’t want to miss my next post, consider subscribing here, it’s free and you’ll be notified when I post my next blog.