Analytics with SQL: Performance report
Backstory:
As you know from my last post I've started a mailing list to promote my new blog. In the last post, we designed a solution for me to get the data to analyze the daily trends of my mailing list, if you haven't read that already check it out here. I do recommend reading that first for context. Now that I have a table with my mailing list behavior, I'm trying to build a report on top of this for me to look at and judge my mailing list and thus my blog's performance.
Problem:
I want to leverage the subscriber_feed table that we built last time. While this table does have the daily numbers it's difficult to analyze what's up with my mailing list just by looking at it. I want to build a report with which I can see how my blog's performing. What I want in this report is by day the number of users that joined and left my mailing list, and how this affected the total users on my mailing list for the past week. Here's the thing though, when it comes to the total users I only have the number for the present day, i.e. how many total users are on my mailing list, you can assume I'm deriving this from the mailing_list table.
Data:
Here's the subscriber_feed table that we built last time with the date, the user_id of the subscriber, and the mailing_list_status which indicates if the subscriber joined(I) or unsubscribed(O) from my mailing list.
+--------------------------+-------+
| date | DATE |
+--------------------------+-------+
| user_id | INT |
|-----------------------------------
| mailing_list_status(I/O) | CHAR |
+--------------------------+-------+
Here's an example of the performance_report I want:
+----------+-------+--------+----------+
| date | total | joined | departed |
+----------+-------+--------+----------+
| 2021/1/2 | 170 | 25 | 5 |
| 2021/1/1 | 150 | 10 | 3 |
+----------+-------+--------+----------+
STOP! Before reading ahead here's where I want you to take some time and think about the problem yourself before reading my approach.
My approach:
So given the subscriber_feed the joined and departed numbers aren't that complicated to get, the only thing to watch out for is that we're looking for the unique user_id count. Below is a way you can get this.
SELECT
date,
COUNT(DISTINCT CASE WHEN mailing_list_status = "I" THEN user_id ELSE NULL
END) AS joined,
COUNT(DISTINCT CASE WHEN mailing_list_status = "O" THEN user_id ELSE NULL
END) AS departed
FROM subscriber_feed
WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY date
ORDER BY date DESC
Now coming to the total part, that's a little tricky given we only have the total for the present day. One possible way is to store every day's total in a table and use that. While this would get us what we want but it'll be overkill since this is something we can derive analytically from the subscriber_feed.
Here's how I approached this, to get the total for a particular day I need today's total and the net change in the total between these two days. I already have the daily change (joined-departed), what I need now is the cumulative change between today and the particular date. I can do this in SQL using the SUM function over a window, the window here will go from present-day to the respective day for the total. Here's what this will get me.
+----------+--------+----------+-----------------+
| date | joined | departed | SUM(net_change) |
+----------+--------+----------+-----------------+
| 2021/1/2 | 25 | 5 | 20 |
| 2021/1/1 | 10 | 3 | 27 |
+----------+--------+----------+-----------------+
While this gets me the changes in the total, this net_change will impact the total for the preceding day and not the present. For example, if the total on 2021/1/2 is 100, the net_change of 20 on this day means that the total on 2021/1/1 must be 80 since we add an additional 20 subscribers. So to perform this in SQL, I basically need the SUM(net_change) for the preceding day. I can do this with the window function LEAD. This gets me something like this.
+----------+--------+----------+-----------------------+
| date | joined | departed | LEAD(SUM(net_change)) |
+----------+--------+----------+-----------------------+
| 2021/1/2 | 25 | 5 | null |
| 2021/1/1 | 10 | 3 | 20 |
+----------+--------+----------+-----------------------+
The only piece left is to get the present-day total and use that to back-calculate the total for the previous days by subtracting our effective_change which is LEAD(SUM(net_change)). Remember our present-day total comes from the mailing_list table.
Putting it all together the final query would look something like this.
SELECT
date,
present_day_total - [ IFNULL(effective_joined,0) - IFNULL(effective_departed,0) ] AS total,
joined,
departed
FROM (
SELECT *,
LEAD(cummulative_joined) OVER(ORDER BY DATE) AS effective_joined,
LEAD(cummulative_departed) OVER(ORDER BY DATE) AS effective_departed
FROM (
SELECT *,
SUM(joined) OVER(ORDER BY date DESC) AS cummulative_joined,
SUM(departed) OVER(ORDER BY date DESC) AS cummulative_departed
FROM (
SELECT
date,
COUNT(DISTINCT CASE WHEN mailing_list_status = "I" THEN user_id ELSE NULL
END) AS joined,
COUNT(DISTINCT CASE WHEN mailing_list_status = "O" THEN user_id ELSE NULL
END) AS departed,
(SELECT COUNT(1) FROM mailing_list GROUP BY user_id) AS present_day_total
FROM subscriber_feed
WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY date
)
)
)
Now normally I would use a CTE for such a big query but then, in this case, I feel the sub-query does justice in understanding the flow of data as described above. Also, a tip here would be that when you encounter such a query never start reading from the top, go to the deepest part and then work your way up to follow the flow of the data through the query.
Food for thought:
Instead of a single blog what if there were multiple blogs and a column indicating the blog_name? Using the PARTITION BY clause would be key here.
What if I also wanted to track how many users re-subscribe to my mailing list, how would I do this?
Finally, now that I have the query for the report how can I best present this? In a table, view, a dashboard, or something else?
Let me know how you'd approach it. If you found this helpful share it. If you're into this, you can find me on Twitter @abhishek27297 where I talk data.