How to bounce-proof your mailing list using Data
Backstory:
As you know from my previous posts, I’ve been developing data solutions for my blog to gain insights from user behavior on my blog. In the last post, we made my blog’s clickstream data useful by building user metrics, see here. Now along with my blog, I also have a mailing list that I do data things with it. Previously, we built a subscriber history table to analyze user interactions. This time we’ll revisit it as there’s a problem that may lead to me being a scammer. What we’ll be doing this time will make you relate with the dark knight, where you do good but end up becoming the villain in people’s eyes. Time for redemption, so let’s jump in.
The Problem:
“The math ain’t mathing” for my mailing list, every time I try to send an email to all my subscribers, the data shows that it was delivered to only a fraction of them, turns out some of my emails “bounced”. Now I want a way to handle these bounces so I don’t end up sending emails that will never reach the inbox and inflate my ego with fake subscribers. Another major concern besides my ego is that a high bounce rate affects my reputation with the email servers, which will end up me being labeled a “scammer”.
The Data:
Here’s my mailing list table, subscriber_feed_current, that we’ve built:
+---------+----------------+-------------+--------------+----------+
| user_id | mailing_status | opt_in_date | opt_out_date | source |
+---------+----------------+-------------+--------------+----------+
| 123 | I | 1/3 | - | LinkedIn |
| 456 | I | 1/7 | - | Medium |
| 789 | I | 1/3 | - | Twitter |
| 999 | O | 1/29 | 1/30 | Twitter |
+---------+----------------+-------------+--------------+----------+
Let’s assume the delivery response is stored in email_delivery_history:
+------+---------+----------------+
| date | user_id | delivered(Y/N) |
+------+---------+----------------+
| 1/1 | 123 | Y |
| 1/1 | 456 | Y |
| 1/1 | 789 | N |
+------+---------+----------------+
STOP! Before reading ahead here’s where I want you to take some time and think about the problem first.
My approach:
Let’s start by understanding what “bounce” means regarding emails. Any time an email is not delivered as intended it’s regarded as a “bounce”. Here are some common causes for email bounces — invalid email, a full inbox, server issues, content, etc. Having that out of the way, we still have to figure out a way to handle these to save my reputation.
Now the easiest thing is unsubscribing these emails, however, given a full inbox can also cause a bounce, this seems drastic like being expelled from school for forgetting your homework. Hence it makes sense to pardon some bounces. A common approach is to classify bounces as “soft” and “hard”. A soft bounce is caused by trivial reasons and is reversible, whereas a hard bounce may be permanent, caused due to an invalid email or something similar.
Unfortunately, the data I have only tells me if the email was delivered or not. However, using this I can develop my own way to classify bounces. Let’s consider every bounce to be a “soft” bounce to start with and if the bounces persist, after a certain number I can graduate it to a “hard” bounce and then remove the email from my mailing list. Let’s design this solution in terms of data, first, we’ll need a record of the number of bounces for each user_id. This bounce_score would be a separate table derived from email_delivery_history and look like this:
+---------+--------------+
| user_id | bounce_count |
+---------+--------------+
| 123 | 2 |
| 456 | 1 |
| 789 | 3 |
+---------+--------------+
We can build using the latest info about users that bounced and update the table by either inserting a record for new bounces or updating the bounce_count if there’s a match. Here’s how we build this with SQL:
MERGE INTO bounce_score AS target
USING (
SELECT *
FROM email_delivery_history
-- latest users that bounced
WHERE date = CURRENT_DATE()
AND delivered = "N"
) AS source
ON source.user_id = target.user_id
WHEN MATCHED
THEN UPDATE
-- increment bounce_count
SET target.bounce_count = target.bounce_count + 1
WHEN NOT MATCHED
-- add to table
THEN INSERT(user_id, bounce_count)
VALUES(source.user_id, 1)
Now that we have the bounce_count for each user_id, I leverage this to label each user indicating their state — soft/hard bounce. As this will be an identifying trait for users, I’d want to add it as a new dimension in our subscriber_feed_current. Additionally, for users with a “hard bounce” status, I’d want to unsubscribe them myself as there’s no point emailing them now. From our example, here’s how the subscriber_feed_current will look after the email send:
See how bounce_status was added and user_id = 789 was unsubscribed.
+--------+--------+-------------+--------------+--------+-------------+
| user_id| status | opt_in_date | opt_out_date | source |bounce_status|
+--------+--------+-------------+--------------+--------+-------------+
| 123 | I | 1/3 | - |LinkedIn| soft bounce |
| 456 | I | 1/7 | - | Medium | soft bounce |
| 789 | O | 1/3 | 2/1 | Twitter| hard bounce |
| 999 | O | 1/29 | 1/30 | Twitter| - |
+--------+--------+-------------+--------------+--------+-------------+
So let’s do this with SQL now, we can use a MERGE statement that sets the bounce_status using the bounce_count for users and updates the mailing_list_status to unsubscribe users with “hard bounce”. A caveat here is we don’t want to make any updates if the user is already unsubscribed.
MERGE INTO subscriber_feed_current AS target
USING bounce_score AS source
ON target.user_id = source.user_id
WHEN MATCHED
THEN UPDATE
SET
-- calculate bounce_status using bounce_count
target.bounce_status = CASE WHEN source.bounce_count >= 3
THEN "hard bounce"
WHEN source.bounce_count BETWEEN 1 AND 2
THEN "soft bounce"
ELSE CAST(NULL AS STRING)
END,
-- unsubscribe users with "hard bounce" if not already
target.status = CASE WHEN source.bounce_count >= 3
AND target.status = "I"
THEN "O"
ELSE target.status
END,
target.opt_out_date = CASE WHEN source.bounce_count>=3
AND target.status = "I"
THEN CURRENT_DATE()
ELSE target.opt_out_date
END
Now that we have our solution, let’s tie it all together with the data flow.
Before we call it a day, a drawback of this approach is that it’s remedial as we’re taking steps after some damage is already done. Can we do better and build something preventative? A major factor leading to email bounces is invalid emails. To combat this we can add checks to see if the emails captured are valid before sending an email. A few examples of these checks are — checking for an “@” symbol in the email, checking if the domain is amongst a list of valid domains, and so on. If an email doesn’t pass these checks, we can mark it as “blacklisted” and not add it to the email list in the first place.
TL;DR
Let’s recap what we did today, we started by understanding email bounces which are leading to discrepancies between my subscriber and email deliverability counts and posing the threat of me being tagged as a scammer. Then we designed a solution using SQL that maintains a count of users’ bounces and labels them accordingly potentially unsubscribing them if needed. Finally, we talked about the drawbacks and discussed preventative approaches to avoid email bounces.
Food for thought
How can we revive a user after a “hard bounce”? This will be needed if the “hard bounce” was attained due to erroneous factors.
How would you design the data flow? What are some steps you’d add?
Even though we don’t add “blacklisted” emails to our list, how do we capture the fact that the user attempted to subscribe with a fake email?
Let me know how you’d approach it. If you found this helpful, share it. If you’re into this, you can find me on Twitter @abhishek27297 where I talk data. If you don’t want to miss my next post, consider subscribing here, it’s free and you’ll be notified when I post my next blog.