Analytics with SQL: Going back in time
Backstory:
As you know from my previous posts, I’ve started a mailing list to promote my new blog. In my very first post in this series, we designed a solution for me to analyze on a given day, who’s joined and who’s left my mailing list, you can find that here. In a corresponding post, we built a report around this giving me insights into how my mailing list is changing, this can be found here. This time the plan is to further extend this and make it a more robust solution. This would help me get a better picture of my mailing list performance, help me answer questions with the right data, and come up with better strategies for growing my blog. There’s going to be some time traveling involved so buckle up for the ride.
Problem:
The specific question I’m looking to answer is on a particular day(in the past), what was the mailing_list_status for a particular user, in other words, were they signed up for my mailing list or had opted out of the same. The challenge is I only have this info available for the present day, that is I know if someone is signed up or opted out of my mailing list as of today.
Data:
Here’s the subscriber_feed with the daily changes that we built in the first post that I want to use as a starting point.
+--------------------------+-------+
| date | DATE |
+--------------------------+-------+
| user_id | INT |
|-----------------------------------
| mailing_list_status(I/O) | CHAR |
+--------------------------+-------+
Here’s an example of a question I’m trying to answer:
Was user_id=123 signed up (mailing_list_status=I) or unsubscribed (mailing_list_status=O) from my mailing list on 1st Jan 2021?
STOP! Before reading ahead here’s where I want you to take some time and think about the problem yourself before reading my approach.
My approach:
Here’s where the concept of a current and a history table comes into the picture. These are related to a concept called “Slowly changing dimensions” in a data warehouse, read more here.
So let me go over these in simple terms, a current table is a table with the most up-to-date or current version of the data. As opposed to a current table that has no knowledge of the past event or how the data got to the current state, a history table as you may have guessed maintains a record of all these changes. There are multiple ways of going about it, but it is a table with a row for all changes in the data and only the most recent change is reflected in the current.
If the current table for user_id=123 looks something like this:
+------+---------+---------------------+
| date | user_id | mailing_list_status |
+------+---------+---------------------+
| 1/30 | 123 | I |
+------+---------+---------------------+
Then the subscriber_feed_history may look something like this:
+------+---------+---------------------+
| date | user_id | mailing_list_status |
+------+---------+---------------------+
| 1/30 | 123 | I |
| 1/25 | 123 | O |
| 1/18 | 123 | I |
| 1/10 | 123 | O |
| 1/1 | 123 | I |
+------+---------+---------------------+
So how do we build this? Let’s start with how are you getting the updates, this may be a file from the source system with all the changes or a table with all changes stored. If you recollect from my first post we built a table with all changes in the subscriber feed for a given date. I highly recommend giving it a read for context in this post. Now if we were to store this data for every day in a separate table, we’ll have ourselves a history table.
Now if you think about the actual data for a second, let’s imagine we have a total of 100 users. This means the current table will have 100 rows for each user and their corresponding status(In or Out). What about the history table? This will have every change for a given user, so if a user subscribes to my mailing list and later opts out that’s 2 rows right there. Now imagine this for all users and you see how this table may get really big with the growing number of users. Hence a key point here is to use partitions, read more here.
We can do this as below. Here the subscriber_feed is the table with the daily updates or changes. Let’s refer to the history table as subscriber_feed_history. This table should be updated on a daily basis with automated pipelines:
-- Initialize
CREATE TABLE `subscriber_feed_history`
PARTITION BY date
AS
SELECT *
FROM `subscriber_feed`
WHERE date = <first_date>
-- Update
INSERT INTO `subscriber_feed_history`
SELECT *
FROM `subscriber_feed`
WHERE date = CURRENT_DATE()
Now that we have a history table let’s design the current table and define a process to update it with new data. If you think about how we can initialize this table it would be similar to the history table as both tables will start with an initial list of users that are subscribed to the mailing list(since no unsubscribes yet). The big difference is the update mechanism. While for the history we’re adding a new row for the change, the current table updates the existing row with the change.
In SQL, this can be done as follows. A few things to note here, first we’re updating the entire current table with only the latest updates from the history table(derived using the date column). Second, we use a merge statement, if you’re not familiar with these read more here. These basically let you perform INSERT, UPDATE, and DELETE operations on a target table that you’re trying to update with a source table which is the source of the updates. All this in one statement which is pretty powerful if you ask me. These have three straightforward conditions which dictate what to do when there’s a match(between source and target) and otherwise.
WITH latest_update AS (
SELECT *
FROM `subscriber_feed_history`
WHERE date = CURRENT_DATE()
)
MERGE `subscriber_feed_current` AS current
USING latest_update
ON current.user_id = latest_update.user_id
-- Scenario: There's an update in status for an existing user
-- Outcome: We update the status in the current table
WHEN MATCHED BY TARGET THEN
UPDATE SET current.mailing_list_status = latest_update.mailing_list_status
AND current.date = latest_update.date
-- Scenario: There's a new user not in the current table, first subscription
-- Outcome: We add this user to the current table
WHEN NOT MATCHED BY TARGET THEN
INSERT(date, user_id, mailing_list_status)
VALUES(latest_update.date, latest_update.user_id, latest_update.mailing_list_status)
The “WHEN NOT MATCHED BY SOURCE” condition is not needed in our case as this is the case where there’s no change in a user’s mailing_list_status, so we do nothing.
Now, you can easily leverage the subscriber_feed_history table to get what we’re looking for but let’s simplify this further to make our lives a little easier. If you think about how you can get the mailing_list_status for a particular user on a given day, you’ll know it’s not super straightforward with the table as of now. Let’s take the above example for user_id=123 referring to the subscriber_feed_history table.
Here for user_id=123 if you want to know the status as of 1/15, it’s unsubscribed(O) as you see the user unsubscribed on 1/10 and re-subscribed on 1/18, so for any date between these, the user was unsubscribed. To make this easier to look for we can add two columns as below.
+------------+------------+---------+---------------------+
| start_date | end_date | user_id | mailing_list_status |
+------------+------------+---------+---------------------+
| 1/30 | 12/31/9999 | 123 | I |
| 1/25 | 1/29 | 123 | O |
| 1/18 | 1/24 | 123 | I |
| 1/10 | 1/17 | 123 | O |
| 1/1 | 1/9 | 123 | I |
+------------+------------+---------+---------------------+
What we did here is we added a date range where this status was applicable. Now to look for any particular day all we can do is look for the record where:
date >= start_date AND date <= end_date
The end_date=12/31/9999 indicates the present row with an infinite end date.
There can be many ways to implement this, the easiest is to create a view on top of your history table with these two columns. This can be done as follows.
CREATE VIEW `subscriber_feed_history_view`
AS
SELECT
date AS start_date,
CASE WHEN LEAD(date) OVER(PARTITION BY user_id ORDER BY date) IS NULL
THEN "12/31/9999"
ELSE
DATE_SUB(LEAD(date) OVER(PARTITION BY user_id ORDER BY date), INTERVAL 1 DAY)
END AS end_date,
user_id,
mailing_list_status
FROM `subscriber_feed_history`
The end_date calculation here first checks if there exists a following change for the user with the LEAD function. If there’s none, this means the change is the latest change(the one present in the current table) and it assigns an infinite end date for this row. On the other hand, if there is a change that follows, we set the end_date as a day prior to the following change with the DATE_SUB function.
And voila! We finally have a way to travel back in time for our data using SQL.
Let me know how you’d approach it. If you found this helpful share it. If you’re into this, you can find me on Twitter @abhishek27297 where I talk data.