From Mass Emails to Targeted Campaigns: Building Data for Machine Learning
Introduction:
It’s time for another episode of Data Detective! As you may remember, we’ve been digging into the nitty-gritty details of user behavior on my blog. Last time, we cracked the case on email bounces and cleared my name from any scamming suspicions, see here. Today, we’re diving into the intriguing world of personalized marketing — that uncanny feeling we all get when ads or social media posts seem to know exactly what’s been on our minds. The plan is to use this to go from sending out mass emails to my subscribers hoping for the best to only send them the blogs I know they’ll read. It’s a tough case, but we’re up for the challenge.
Learning objective: Explore an end-to-end ML problem for personalized marketing. You’ll learn step-by-step how to construct a predictive model that predicts user engagement for specific topics. Additionally, you’ll gain valuable knowledge on aggregating user attributes to generate the training data for the model. By the end, you’ll be equipped with the skills to implement targeted marketing based on user preferences to optimize marketing efforts.
The Data:
Like a detective arriving at a crime scene, let’s begin by examining the available evidence — the data that will aid us in our investigation. Starting with, subscriber_feed_history that shows a subscriber’s mailing list history: (Source)
+------------+----------+---------+--------+----------+
| start_date | end_date | user_id | status | source |
+------------+----------+---------+--------+----------+
| 1/1 | - | 123 | I | Twitter |
| 3/2 | - | 456 | O | - |
| 2/1 | 3/1 | 456 | I | Medium |
| 4/2 | - | 789 | O | - |
| 2/2 | 4/1 | 789 | I | LinkedIn |
| ... | ... | ... | ... | ... |
+------------+----------+---------+--------+----------+
Next, there is user_clickstream, which ties clickstream events to users: (Source)
+---------------+---------+--------+-----------------+----------+-----+
| date_time |visit_id |user_id | url | referrer | ... |
+---------------+---------+--------+-----------------+----------+-----+
|1/15/2023 12:00| gft79 | 123 | ./back-in-time | email | ... |
|2/15/2023 2:15 | hgq98 | 456 | ./new-returning | email | ... |
|2/15/2023 10:10| hce20 | 123 | ./new-returning | web | ... |
|3/15/2023 7:00 | buo90 | 789 | ./metrics | email | ... |
| ... | ... | ... | ... | ... | ... |
+---------------+---------+--------+-----------------+----------+-----+
STOP! Before reading ahead here’s where I want you to take some time and think about the problem first.
My approach:
Let’s start by phrasing the problem as a data problem. Essentially, our objective is to predict who is more likely to read a particular blog before I send them an email about it. There are two ways to go about this — a descriptive approach would be to run statistics on past blogs to figure out user preference, however, this cannot personalize predictions. A second approach would be to build a predictive model that takes into account user behavior and predicts the probability of a user reading a particular blog, we’ll pick this approach today.
To streamline our efforts, I’ve categorized all my blogs into three distinct categories: “email,” “customer,” and “data”. By categorizing them, we can assess the likelihood of users reading specific topics, allowing for a broader prediction. Here’s the blogs table with this info:
+--------------+-----------------+----------+
| publish_date | url | category |
+--------------+-----------------+----------+
| 1/15 | ./back-in-time | email |
| 2/15 | ./new-returning | customer |
| 3/15 | ./metrics | data |
| ... | ... | ... |
+--------------+-----------------+----------+
Next question, how do I build such a model? Many ways to do this but today we’ll be using supervised learning. This entails building a model by showing sample data of what’s expected.
So what should this data look like? Let’s reverse engineer the solution for a second, the outcome we’re looking for is: given a user or in other words, given a set of user attributes, I want to determine their likelihood of reading a certain topic. One thing to note is that I’m looking to make this prediction at a user level, hence the model should see examples at a user level as well. To achieve this, we need a labeled dataset comprising examples of users who have chosen to read a specific topic and others who haven’t. Here’s an example of how this labeled data might look:
+------+---------+----------+------+
| date | user_id | category | read |
+------+---------+----------+------+
| 1/15 | 123 | email | Y |
| 2/15 | 123 | customer | N |
| 2/15 | 456 | customer | Y |
| 3/15 | 123 | data | N |
| 3/15 | 789 | data | Y |
| ... | ... | ... | ... |
+------+---------+----------+------+
Before we get here, let’s define what we consider a user reading a topic. This can be broad as many users go back and read multiple blogs. We will focus on those that read a blog after seeing it in the email. Additionally, to restrict the scope, we’ll consider only those reads that occurred before the following email was sent out.
So to find out who read and who didn’t, I need to know who was sent an email, who clicked on it, who didn’t click, or who clicked too late. Let’s think about how to use the above data to get to this. As we know my current strategy is to email everyone on my list, so all users that were subscribed on the day I published a blog should’ve been sent the email. Further, I can use the clickstream data to find out the first time they opened the blog and use referrer=”email” for email clicks. Let’s write this in SQL:
-- Mailing list history for users
WITH subscribed_users AS (
SELECT start_date, IFNULL(end_date,"12/31/9999") AS end_date, user_id
FROM subscriber_feed_history
WHERE status = "I" -- Opted-In
),
-- Blog details (Remember, publish date = email sent date)
blog AS (
SELECT publish_date AS sent_date, url, category,
IFNULL(
LEAD(publish_date) OVER(ORDER BY publish_date),
"12/31/9999"
) AS next_sent_date
FROM blogs
),
-- All users sent an email (Criteria: Subscribed when blog was published)
email_sent AS (
SELECT sent_date, next_sent_date, url, category, user_id
FROM subscribed_users u
JOIN blog b
ON u.start_date <= b.sent_date < u.end_date
)
-- Users who opened the email
email_open AS (
SELECT user_id, url, MIN(DATE(date_time)) AS open_date -- first open date
FROM user_clickstream
WHERE referrer = "email" -- source: "email"
GROUP BY user_id, url
),
-- Read criteria: sent the email + opened the email(in time)
read AS (
SELECT s.sent_date, s.user_id, s.category, 1 AS read
FROM email_sent AS s
JOIN email_open AS o
ON s.user_id = o.user_id
AND s.url = o.url
AND s.next_sent_date > o.open_date
),
-- Not read criteria: sent the email + (didn't open OR opened late)
not_read AS (
SELECT s.sent_date, s.user_id, s.category, 0 AS read
FROM email_sent AS s
LEFT JOIN email_open AS o
ON s.user_id = o.user_id
AND s.url = o.url
WHERE o.user_id IS NULL -- didn't open
OR (o.user_id IS NOT NULL
AND s.next_sent_date <= o.open_date) -- opened late
)
SELECT *
FROM read
UNION ALL
SELECT *
FROM not_read
Now let’s think about what user attributes would help us answer this question. Depending on how much data you’re working with, it makes sense to let the model figure out what works or doesn’t, we can use our intuition to kickstart this process. The user attributes can be split into two buckets — categorical(Yes/No) and numerical. While categorical variables tell us more about the user itself, numerical variables tell us about the user's behavior. Below is a list of variables that will get us started:
Categorical
mailing_list_status: Is the user subscribed?
opt_in_source: User acquisition source
Numerical
page_views: Total number of page views by a user
time_spent: Total time spent on the blog
A key thing to note when working with user attributes is that they keep changing. A user’s mailing_list_status can change at any minute, and the total page_views can be different depending on when you calculate. With all these variances it’s time to apply constraints to make this a deterministic problem. Now in most cases, a subscription is earned after multiple interactions. Hence, it would make sense to aggregate the above variables over a period. For example, for the email sent on 1/15/23, I need past user attributes that contributed to the read response. To determine the appropriate aggregation interval, it’s necessary to experiment with different time frames. In this scenario, let’s assume we’ll aggregate attributes over the past month.
Aggregating attributes look different for categorical & numerical features. While for numerical it’s merely adding up the page_views, for categorical features like mailing_list_status, it’s subjective as the user can subscribe and unsubscribe multiple times in a month. Hence, we need to consider what we want to capture with this feature. In this case, the focus is on whether the user was subscribed at any point during the last month. Therefore, we prioritize capturing if the user was ever opted-in during that period. Similarly, for the source feature, we would select the first acquisition source within the interval.
-- Blog: Aggregate over past 1 month from publish_date
WITH blog_data AS (
SELECT publish_date AS end_date,
DATE_DIFF(publish_date, INTERVAL 1 MONTH) AS start_date
url, category
FROM blogs
),
-- Mailing list history to determine status & source for a period
mailing_list AS (
SELECT start_date, IFNULL(end_date,"12/31/9999") AS end_date,
user_id, status, source
FROM subscriber_feed_history
),
-- Aggregate status for every start and end date, prioritizing Opt-Ins
aggregated_mailing_list_status AS (
SELECT blog.end_date, blog.start_date, list.user_id,
MIN(list.status) AS agg_status -- to choose "I" over "O"
FROM blog_data AS blog
JOIN mailing_list AS list
-- include start date
ON (list.start_date <= blog.start_date <= list.end_date)
-- include end date
OR (list.start_date <= blog.end_date <= list.end_date)
OR [ -- capture everything in between
(blog.start_date < list.start_date)
AND (list.end_date < blog.end_date)
]
GROUP BY blog.end_date, blog.start_date, list.user_id
),
-- Acquisition Source
source AS (
SELECT blog.end_date, blog.start_date, list.user_id, list.source,
RANK() OVER(
PARTITION BY blog.end_date, blog.start_date, list.user_id
ORDER BY list.start_date
)
AS source_rank -- Rank sources for each period
FROM blog_data AS blog
JOIN mailing_list AS list
ON (list.start_date <= blog.start_date <= list.end_date)
OR (list.start_date <= blog.end_date <= list.end_date)
OR [
(blog.start_date < list.start_date)
AND (list.end_date < blog.end_date)
]
),
-- Aggregated source: First source picked for each period
aggregated_source AS (
SELECT end_date, start_date, user_id, source AS agg_source
FROM source
WHERE source_rank = 1 -- First source for every period
)
-- Combine all categorical variables
SELECT end_date AS date,
user_id,
agg_status AS status,
agg_source AS source
FROM aggregated_mailing_list_status
JOIN aggregated_source
USING(end_date, start_date, user_id)
The JOIN between blogs and mailing_list above is trying to associate every user status between the publish date and the past month to this period. To further use these, we’ll need to make them one-hot vectors as below:
df['status'] = df['status'].replace({'Y': 1, 'N': 0})
df_source_onehot = pd.get_dummies(df['source'],drop_first=True)
Next, let’s look at numerical variables:
-- Blog: Aggregate over past 1 month from publish_date
WITH blog_data AS (
SELECT publish_date AS end_date,
DATE_DIFF(publish_date, INTERVAL 1 MONTH) AS start_date
url, category
FROM blogs
),
-- Visit stats from clickstream
visit_stats AS (
SELECT DATE(date_time) AS event_date, user_id, visit_id, url,
COUNT(1) AS page_view, -- total clicks
DATE_DIFF(
MAX(date_time), MIN(date_time), MINUTE
) AS time_spent -- Difference between first and last click
FROM user_clickstream
GROUP BY DATE(date_time), user_id, visit_id, url
),
-- Aggregate visit stats for blogs
agg_visits AS (
SELECT start_date, end_date, user_id, category,
SUM(page_view) AS agg_page_view,
SUM(time_spent) AS agg_time_spent
FROM blog_data AS blog
JOIN clickstream
ON blog.start_date <= clickstream.event_date <= blog.end_date
AND blog.url = clickstream.url
GROUP BY start_date, end_date, user_id, category
)
-- Numerical features
SELECT start_date, end_date, user_id, category,
agg_page_view AS page_views,
agg_time_spent AS time_spent
FROM agg_visits
Combining the above two results gets us our features as below:
+------+---------+----------+--------+--------+------------+------------+
| date | user_id| category | status| source | page_views | time_spent |
+------+--------+----------+-------+--------+------------+------------+
| 1/15 | 123 | email | 1 | [1 0] | 5 | 10 |
| 2/15 | 123 | customer | 1 | [1 0] | 7 | 14 |
| 2/15 | 456 | customer | 1 | [0 1] | 3 | 4 |
| 3/15 | 123 | data | 1 | [1 0] | 13 | 12 |
| 3/15 | 456 | data | 1 | [0 1] | 5 | 7 |
| 3/15 | 789 | data | 1 | [0 0] | 2 | 8 |
| ... | ... | ... | ... | ... | ... | ... |
+------+--------+----------+-------+--------+------------+------------+
Finally, this gives us a glimpse into the training dataset, with each row representing a unique user and their corresponding aggregated attributes which will help us know a user’s likelihood of reading a blog.
TL;DR
Let’s recap what we did today:
Explored the idea of personalized marketing for my blogs and its ability to enhance user engagement.
Delved into how to build a predictive model using user data to predict the likelihood of user engagement with specific blog topics, allowing me to send targeted emails to my users.
Identified potential features, defined the criteria and constraints to capture their essence, and developed an approach to consolidate features such as total page views and subscription status over a specific period to create training data for the predictive model.
Food for thought
A drawback of our approach is that it limits subscribers from discovering new topics that they may like, how do we combat this? Recommending new blogs can be helpful here.
What are some other variables you can use to predict likelihood? Think about incorporating factors such as user demographics or browsing history.
How do we test for aggregations over different time periods? A cross-validation approach is needed with our solution.
Would you build multiple models for each category or just one? Why the preference? Things to consider here are the size of your dataset, variance across categories, and trade-offs between model complexity and interpretability.
Let me know how you’d approach it. If you found this helpful, share it. If you’re into this, you can find me on Twitter @abhishek27297 where I talk data.
If you don’t want to miss my next post, consider subscribing here, it’s free and you’ll be notified when I post my next blog.