Analytics with SQL: Hunt for the nonexistent
~ Finding the subscriber acquisition source with SQL
Backstory:
As you know from my previous posts, I’ve started a mailing list to promote my blog. In the last post in this series, we designed a way to travel back in time with data by building a history table to analyze my mailing list behavior, you can find that here. This time the plan is to extend it by adding some flavor that would enable us to answer questions like: How do folks encounter my mailing list? What drives them to subscribe? If they ever re-subscribe, what caused it? What we’ll be doing today would seem like throwing darts into the dark in hopes of hitting the bulls-eye, pretty similar to life if you ask me. So sharpen your aim and let’s dive in!
Problem:
So last time we were able to build ourselves the subscriber_feed_history table which gives us insight into a user’s mailing_list_status by day. Now while this is helpful to determine if a user was opted in or out on a particular day, it doesn’t tell me much about what drove the user to the particular action. To start with I’d like to know which source is leading users to subscribe to my mailing list. This will help me choose the right platform for promotion.
Data:
We’re starting with the subscriber_feed_history table we built here, this looks as follows:
+------+---------+---------------------+
| date | user_id | mailing_list_status |
+------+---------+---------------------+
| 1/23 | 777 | I |
| 1/15 | 123 | I |
| 1/8 | 456 | I |
| 1/4 | 789 | I |
+------+---------+---------------------+
I was also able to use a similar approach to build a source_history table from clickstream data for acquisition clicks(similar to what we worked with here). This table is a history of users' clicks with the source where the action was recorded. Here’s what the table looks like:
+-------------+---------+-----------+
| action_date | user_id | source |
+-------------+---------+-----------+
| 1/15 | 123 | LinkedIn |
| 1/7 | 456 | Medium |
| 1/3 | 456 | Direct |
| 1/3 | 789 | Instagram |
| 1/1 | 789 | Twitter |
| 1/1 | 777 | Twitter |
+-------------+---------+-----------+
The aim is to use this source_history to enhance the subscriber_feed_history with source info for the opt-in, i.e. what led the user to subscribe or how they encountered my blog. Also, there may be different reasons a user chooses to subscribe so we’ll do our best to capture this effect.
+------+---------+--------+---------------+
| date | user_id | status | opt_in_source |
+------+---------+--------+---------------+
| 1/23 | 777 | I | ?? |
| 1/15 | 123 | I | ?? |
| 1/8 | 456 | I | ?? |
| 1/4 | 789 | I | ?? |
+------+---------+--------+---------------+
STOP! Before reading ahead here’s where I want you to take some time and think about the problem yourself before reading my approach.
My approach:
A good practice when working with user data is to think about the customer journey. In our case, here’s a scenario, a user comes across my blog on a tweet and decides to check it out. From here the user either bounces or finds my blog interesting and decides to subscribe to my mailing list. Here’s where since “Twitter” led the user to my blog, it’ll be considered the source of the opt-in.
Now to implement this with data, the easiest way would be to perform a JOIN on the user_id and date between the subscriber_feed_history and source_history. This works perfectly for user_id=123 in the data above and now we know the source for that opt-in was LinkedIn.
+------+---------+---------------------+----------+
| date | user_id | mailing_list_status | source |
+------+---------+---------------------+----------+
| 1/15 | 123 | I | LinkedIn |
+------+---------+---------------------+----------+
However, this doesn’t help us for the others. So let’s think about when this approach would fail. I can think about a couple of scenarios like when the user encounters the blog on a particular day but decides to subscribe to the mailing list on a later day. Here’s another one: a user encounters the blog through multiple sources on the same day and then subscribes later. In these cases what’s happening is the JOIN on the date field is causing problems as the date may not be an exact match in both these tables which looks to be the case for most of the users above.
To account for this rather than JOINing on the date field which expects an exact match, we should find the closest date which recorded an opt-in corresponding to a subsequent visit. For example, for user_id=456 who opted-in on 1/8, here’s the JOIN between the two tables:
+---------+-------------+--------+-------------+---------+
| user_id | opt_in_date | status | action_date | source |
+---------+-------------+--------+-------------+---------+
| 456 | 1/8 | I | 1/7 | Medium |
| 456 | 1/8 | I | 1/3 | Direct |
+---------+-------------+--------+-------------+---------+
Here’s where we can clearly say the opt-in on 1/8 should be attributed to Medium since there was a corresponding visit on 1/7. If you think about what made you take this call, it’s the difference between the opt_in_date and the action_date (delta) which drove your decision. So effectively you’re looking for a rank based on this delta which will help you with the matching.
+---------+--------+--------+-------------+---------+-------+------+
| user_id | opt_in | status | action_date | source | delta | rank |
+---------+--------+--------+-------------+---------+-------+------+
| 456 | 1/8 | I | 1/7 | Medium | 1 | 1 |
| 456 | 1/8 | I | 1/3 | Direct | 5 | 2 |
+---------+--------+--------+-------------+---------+-------+------+
Before you run off with this solution, let’s think about where this could break. Let’s consider user_id=777:
+---------+--------+--------+-------------+--------+-------+------+
| user_id | opt_in | status | action_date | source | delta | rank |
+---------+--------+--------+-------------+--------+-------+------+
| 777 | 1/23 | I | 1/1 | ?? | 22 | 1 |
+---------+--------+--------+-------------+--------+-------+------+
In this case, the user clicked on a blog on the 1st and then eventually subscribed to the mailing list after almost a month. With our previous approach the delta, in this case, becomes 22 days, and given it is the only visit, it gets a rank=1. So what you’re saying here is that the user subscribed because of the interaction with the blog from a month ago. Now I’m not sure about your memory but I can’t recall stuff I saw on the internet from yesterday let alone a month. Here’s where we’d like to restrict the delta to prevent incorrect matching.
So let’s look at how we’d put this down in code:
-- All opt-ins from subscriber history
WITH subscriber_history AS (
SELECT date AS opt_in_date, user_id
FROM `subscriber_feed_history`
WHERE mailing_list_status = "I"
GROUP BY date, user_id
),
-- User actions with source
source_history AS (
SELECT action_date, user_id, source
FROM source_history
WHERE source <> "Direct"
),
-- Delta calculation: (Opt-In date - Click date)
subscriber_source_history AS (
SELECT opt_in_date, user_id, source,
DATE_DIFF(opt_in_date,action_date, DAY) AS delta
FROM subscriber_history
LEFT JOIN source_history
USING(user_id)
WHERE delta BETWEEN 0 AND 1
)
-- Getting the best match for the source of opt-in
SELECT * EXCEPT(rnk)
FROM (
SELECT * EXCEPT(delta),
RANK() OVER(PARTITION BY opt_in_date, user_id ORDER BY delta) AS rnk
FROM subscriber_history_source
) WHERE rnk = 1
If you’re wondering why Direct sources were left out, read here.
Now you can stop here if you’re happy with the source match rate for opt-ins. If you want a better match rate, you can play around with the delta above. The broader the window, the more the match rate but the lesser the reliability of the match. Or you can keep on reading for a different approach.
Now I’d like to talk about an alternative approach to the same problem which doesn’t rely on the dates at all. Let’s consider a user who’s not sure if they really want to receive emails from me and hence end up re-subscribing multiple times. Now for every re-subscription, the user interacted with the blog in some way that led to it. From the clickstream, we know how many times the user interacted with the blog irrespective of the date. So rather than matching on the date, we can purely match on the occurrence of the event.
Here’s an example for user_id=999:
subscriber_feed_history:
+------+---------+--------+------------+
| date | user_id | status | opt_in_num |
+------+---------+--------+------------+
| 2/4 | 999 | I | 5 |
| 1/30 | 999 | I | 4 |
| 1/20 | 999 | I | 3 |
| 1/8 | 999 | I | 2 |
| 1/4 | 999 | I | 1 |
+------+---------+--------+------------+
source_history:
+-------------+---------+-----------+-----------------+
| action_date | user_id | source | interaction_num |
+-------------+---------+-----------+-----------------+
| 1/31 | 999 | Instagram | 5 |
| 1/26 | 999 | Twitter | 4 |
| 1/13 | 999 | Medium | 3 |
| 1/5 | 999 | Twitter | 2 |
| 1/1 | 999 | Organic | 1 |
+-------------+---------+-----------+-----------------+
In this case, we know the user interacted with the blog 5 times and correspondingly subscribed to the mailing list 5 times as well. So with this approach, if there’s a match in the number of opt-ins with the number of interactions, we match the source with the opt-in.
Here’s what this looks like in SQL:
-- All opt-ins with opt-in number
WITH subscriber_history AS (
SELECT date AS opt_in_date, user_id,
RANK() OVER(PARTITION BY user_id ORDER BY date) AS optin_num
FROM `subscriber_feed_history`
WHERE mailing_list_status = "I"
GROUP BY date, user_id
),
-- User actions with source and interaction number
source_history AS (
SELECT user_id, source,
RANK() OVER(PARTITION BY user_id ORDER BY action_date) AS intrcn_num
FROM source_history
WHERE source <> "Direct"
)
-- Matching available user opt-ins on occurrence
SELECT opt_in_date, user_id, source
FROM subscriber_history sub
LEFT JOIN source_history src
ON sub.user_id = src.user_id
AND sub.optin_num = src.intrcn_num
WHERE src.intrcn_num IS NOT NULL
Note: We do use the dates here to calculate the occurrence but not for the actual matching. So now the dates don’t have as much importance as before.
Finally, why choose one when you can have both? The only caveat here is you’d want to give the first approach a higher priority over the latter for when there’s a source matched with both these.
Here’s how you do that in SQL:
WITH subscriber_history AS (
SELECT date AS opt_in_date, user_id,
RANK() OVER(PARTITION BY user_id ORDER BY date) AS optin_num
FROM `subscriber_feed_history`
WHERE mailing_list_status = "I"
GROUP BY date, user_id
),
source_history AS (
SELECT action_date, user_id, source,
RANK() OVER(PARTITION BY user_id ORDER BY action_date) AS intrcn_num
FROM source_history
WHERE source <> "Direct"
),
subscriber_source_delta AS (
SELECT opt_in_date, user_id, source,
DATE_DIFF(opt_in_date,action_date, DAY) AS delta
FROM subscriber_history
LEFT JOIN source_history
USING(user_id)
WHERE delta BETWEEN 0 AND 1
),
-- Approach #1: With dates
subscriber_source_history_delta AS (
SELECT * EXCEPT(rnk), 1 AS solution_rank
FROM (
SELECT * EXCEPT(delta),
RANK() OVER(PARTITION BY opt_in_date, user_id ORDER BY delta) AS rnk
FROM subscriber_history_source
) WHERE rnk = 1
),
-- Approach #2: With occurrence
subscriber_source_history_occurrence AS (
SELECT opt_in_date, user_id, source, 2 AS solution_rank
FROM subscriber_history sub
LEFT JOIN source_history src
ON sub.user_id = src.user_id
AND sub.optin_num = src.intrcn_num
WHERE src.intrcn_num IS NOT NULL
),
-- Combining the two with priority for the same opt-in
subscriber_source_history_combined AS (
SELECT * EXCEPT(solution_rank),
RANK() OVER(PARTITION BY user_id, opt_in_date ORDER BY solution_rank) AS final_solution_rank
FROM (
SELECT * FROM subscriber_source_history_delta
UNION ALL
SELECT * FROM subscriber_source_history_occurrence
)
)
-- Final source with priority #1 over #2
SELECT opt_in_date, user_id, source AS opt_in_source
FROM subscriber_source_history_combined
WHERE final_solution_rank = 1
Food for thought:
What other things can you look at to enhance the historical data? What can we do for the source of opt-outs?
For the first approach, what if there were two visits from different sources on the same day? How do you break the tie?
Why would you not rely on the dates in the data?
What are some drawbacks of the second approach?
In both approaches, why is the JOIN between the subscriber and source history a LEFT JOIN?
What other ways can you match the subscriptions?
Let me know how you’d approach it. If you found this helpful share it. If you’re into this, you can find me on Twitter @abhishek27297 where I talk data.