Analytics with SQL: Marketing is difficult!
~Analyzing subscriber trends with SQL
Backstory:
To promote my recently launched blog, I decided to start a mailing list to reach more people. When users visit my blog there's an option at the bottom which lets them subscribe to the mailing list. Whenever a new blog releases the users on the list get an email with a link to the blog. Now if a user decides that they're not really interested in my blogs(☹) they have an option to unsubscribe from the mailing list at the bottom of the email. I decided to do some analysis on user behavior by looking at this data.
Problem:
I want to look at how many users are joining my mailing list on a daily basis and how many are unsubscribing to see what's working and what's not. However, the data I have is a daily feed of all users on my mailing list. Whenever someone subscribes, they're added to the mailing list and they show up in the following day's feed. Whenever a user unsubscribes, they vanish from the data feed the next day. I want to get to the point where I have three things for all the users on my mailing list: the date when they joined or left the mailing list, the user_id of the user, and the mailing_list_status, which is whether they joined(I) or left(O) my mailing list. Here's another catch, there are some days for which I don't have this feed because of things not working.
Data:
Below is what the mailing_list feed looks like. Remember this is an almost daily feed with a list of all the users on my mailing list. For example, 2021/1/1 will have 50 records with user_ids and 2021/1/2 will have 55 records which may mean 10 users joined and 5 left, which is what I want to get to.
+ - - - - -+ - - - -+
| date | DATE |
+ - - - - -+ - - - -+
| user_id | INT |
+ - - - - -+ - - - -+
Here's the subscriber_feed I want:
+--------------------------+-------+
| date | DATE |
+--------------------------+-------+
| user_id | INT |
|-----------------------------------
| mailing_list_status(I/O) | CHAR |
+--------------------------+-------+
STOP! Before reading ahead here's where I want you to take some time and think about the problem yourself before reading my approach.
My approach:
Now with everything in data, there are multiple correct approaches here but here's what I did. Also, you can solve the same problem using either SQL, python, R, or any other tool. Starting with the simple case where I have the data from today and yesterday and I want to find out who's in and who's out.
WITH present_day AS (
SELECT date, user_id
FROM mailing_list
WHERE date = "{present_day}"
),
day_before AS (
SELECT date, user_id
FROM mailing_list
WHERE date = "{previous_day}"
),
joined AS (
SELECT "{present_day}" AS date, pd.user_id, “I” AS mailing_list_status
FROM present_day pd LEFT JOIN day_before db
ON pd.user_id = db.user_id
WHERE db.user_id IS NULL
),
departed AS (
SELECT "{present_day}" AS date, db.user_id, “O” AS mailing_list_status
FROM day_before db LEFT JOIN present_day pd
ON pd.user_id = db.user_id
WHERE pd.user_id IS NULL
)
SELECT *
FROM joined
UNION ALL
SELECT *
FROM departed
This will get me what I need. But there's still something missing here. In my query I use the "{present_day}" and "{previous_day}" variables, how do I get this?
A simple solve will be to set:
present_day = CURRENT_DATE()
previous_day = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
However, because the feed is missing for some days, the previous_day may not always be the day before. Here's where rather than guessing the dates, you can get it from the data itself by doing something like this.
SELECT date AS present_day, LAG(date) OVER(ORDER BY date) AS previous_day
FROM (
SELECT date
FROM mailing_list
GROUP BY date
)
ORDER BY date
Tying it all together, you can use the dates from the second query to feed the first query, and boom you have what you need. There are many ways to do this, what I did was store the results of the second query in a python list and call the first query with the dates as arguments.
Food for thought:
What happens if there were multiple blogs and we had an additional column for blog_name, how would that modify the approach?
How do you initialize the subscriber_feed where the previous_day is NULL? How would you store this data efficiently?
Let me know how you'd approach it. If you found this helpful share it. If you're into this, you can find me on Twitter @abhishek27297 where I talk data.