Analytics with A/B testing: How do I less annoy my customers?
~ Using A/B testing to test an email marketing strategy
Backstory:
As you know from my previous posts, I’ve started a mailing list to promote my new blog. In my last post, we did some time traveling with SQL where we designed an approach to do historical analysis with data, if you missed that, you can check it out here. This time we take a walk down a different realm of analytics which is A/B testing. If you’re not familiar with A/B testing, it’s basically a way to test a new idea and its impact in a controlled way, read more here. Now an honest disclaimer here, unlike SQL, I’m not an expert in A/B testing. On the contrary, I’m relatively new to the field but find it fascinating because of its capabilities.
Problem:
If you check out my blog, you’ll notice I write about more than one topic. The mailing list for the blog is to promote the entire blog. However, I have subscribers that have signed up for a particular topic, for example, some readers are only interested in getting notified about my “data” blogs. The question I’m looking to answer is: Are these readers also interested in other topics I write about? The way I plan on testing this is by emailing these subscribers about blogs related to other topics, like “food”. Now I know this sounds like a bad idea to start with but you never know until you test it, right? Hence the plan is to test this idea first in a controlled manner with A/B testing before going berserk.
STOP! Before reading ahead here’s where I want you to take some time and think about the problem yourself before reading my approach.
My approach:
Who to test?
So the first step would be to think about who to test, we’re trying to test folks who are subscribed to my email list specifically for “data” blogs to see if they’re interested in other topics too. The way A/B testing works is you split the target population into two groups, an experiment and a control group where the prior gets tested for the change and the latter doesn’t. So in our case, we’ll be emailing folks in the experiment group about other topics while the control group will be emailed only data blogs.
Now that we have a clearer picture of the two groups, how do we select candidates for the two groups? Our goal with the selection is to avoid any “bias” so the groups are comparable and results are not driven by any other factor. So a naïve way to do this would be to do a random sample of the mailing list, like a coin toss for each subscriber, to decide their group. However, even with the random selection, we may end up with biased groups as we didn’t account for factors that can create a bias in the first place. Hence, we opt for a stratified sample where the population is first split into different groups and then we pick randomly from each with an aim to balance across these groups. For example, in our case, we know different people read a blog with varying levels of attention. So in an attempt to not end up with all the fervent readers in one group which would skew the results as this group cares more than the other group innately, we can generate samples stratified by historical average reading time. Some other variables we may want to account for are demographics(age group, gender, etc.), activity, and recency of visit.
If you’re curious how one can actually do this, it can be achieved easily with the train_test_split method as follows:
experiement_group, control_group = train_test_split(
mailing_list,
test_size = 0.5,
stratify = mailing_list[
["reading_time",
"age_group",
"gender",
"recency"]
]
)
Also while we’re at coding, this would be the time to clean your data which would involve things like removing outliers. For example, in my case, since my parents are also on my mailing list and they’ll read all my blogs irrespective(❤)️, it would make sense to leave them out of this experiment to avoid skewing the results.
What to test?
Now that we have an idea about who we’ll be testing, let’s think about what are we actually testing. In A/B testing we often pick one or more metrics that we’ll be tracking across both the groups to base our decision on. Here are some points to think about before choosing a metric:
It should be close to the change so that it can capture the effect of your change more effectively. For example, if you’re testing a change on an e-commerce website’s button, something like a click-through rate would be closer to your change as opposed to the sales conversion rate.
You’re also looking for something that is specifically affected by your change and remains fairly constant otherwise to rule out other factors affecting your results. This is why it’s beneficial to have an invariant and an evaluation metric. The former is something that you don’t expect to change and will serve as an anchor for your test to make sure your change is not bleeding into things it shouldn’t. Whereas the evaluation metric is one that would be sensitive to your change.
In our case to test out my mailing list idea, possible evaluation metric options can be the number of views, opens, clicks, or anything that makes a user go from the email to the actual blog. However, in this case, since I’m more concerned that this change may annoy my readers leading them to unsubscribe(unsub) from my mailing list, the metric I’ll choose here will be the number of unsubs to test if the change is potentially harming my mailing list. The number of unsubs here is what’s called the unit of analysis.
On the contrary for an invariant metric, we can choose something like the average reading time(for data blogs). Given our groups are sampled correctly, the average reading time for new data blogs is expected to be somewhat constant across the groups.
Evaluation Metric = Number of unsubscribes
Invariant Metric = Average reading time
What’s my theory?
Now that we’ve determined the metric, we have a better picture of what we plan to test. So let’s put pen to paper and come up with the hypothesis for our test which is a way to set expectations for the test. If we think about our goal, we’re trying to measure if the change is causing increased unsubs in the experiment group as opposed to the control. So along those lines, our Null Hypothesis(H0) which is something we deem to be true is that there’s no difference between the two groups. Unless proved otherwise through the test in which case the alternate hypothesis(H1), which is something we’re looking to test, is TRUE. In our case since we’re only concerned with the negative impact of our change, specifically if unsubs are higher in the experiment than control, which makes it a one-sided test.
Null Hypothesis(H0):
No significant difference in the number of unsubs across both the experiment and control group.
Alternate Hypothesis(H1):
Experiment group has higher unsubs as opposed to the control group.
How many to test?
Now that we have a clear picture of what we’re looking for, we’re ready to jump into the logistics of the test and address questions like how many subscribers should we test? An easy way out is to test the entire mailing list, half in each group. However, that’s not the best idea because this is still a test with the worst case being I may end up losing half my mailing list which won’t be very nice. Another thing you can do is slowly ramp up the exposure of the test to the experiment group to prevent adversities. So I want to test no more than I need to be sure about my decision. How do I get this magic number?
Enter Evan’s Awesome A/B Tools (can’t go wrong with awesome in the name). This is a set of tools that do a lot of the required stats behind the scenes for you to make a decision. Here the sample size calculator will answer the how many. However, it does need to know a little about our test setup through some parameters that we can set, so let’s get into these. Starting with the “baseline conversion rate” which as the name suggests is the conversion rate without the change which sets the baseline for the test. This is the value of the metric observed historically. So in our case, this becomes the baseline unsub rate that has been observed. The higher the baseline the bigger the sample you’ll need to test it as your change then is less sensitive. Let’s assume this number is 13% in our case.
The next thing you need is the “Minimum Detectable Effect”(Dmin) which is basically how much change would you consider to be significant. In our case, would a 2% increase in unsubs be enough for me to call it quits? This has to be one of the most subjective things about A/B testing. The smaller the change you want to detect the bigger the sample size you’ll need and vice-versa. This is a real struggle because in most of the things we will test, we expect there to be a small change in the metrics because if it was otherwise would you even test it? For our test, let’s settle for a 2% change. Now if I just plug these in, I get a sample size of 4,523 for each of the groups. So, I’ll need to test this change on ~9k subscribers. This will also determine how long I need to run the test to expose the change to these many subscribers.
If you notice carefully there are also two more parameters at the bottom which more or less indicate the amount of error you’re okay with viz. alpha and beta. Alpha is the probability that you reject the null hypothesis when it was indeed true and beta is the probability you fail to reject the null hypothesis when it should’ve been rejected.
Baseline conversion rate for unsubs = 13%
Minimum Detectable Effect(Dmin) = 2%
alpha = P[Reject H0 / H0 is TRUE] = 5%
beta = P[Failure to reject H0 / H0 is FALSE] = 20%
Required sample size = 4523/group
The aim of hypothesis testing is to make conclusions about the population with merely a sample of it. The larger the sample the closer the distribution to normal, the lesser the variability, and the better the picture of the population. If you think the sample size is more than what you can afford, you can reconsider the Dmin to detect a less profound change as the baseline rate is out of your control for now.
How to test:
In our case since the metric we are testing is a “rate” (of unsub) across the groups, we’ll be performing the “Chi-Squared Test” to test our hypothesis. Again there’s a section in the tool that does this for you(see here). If the metric was a continuous variable like reading time, we’d opt for the t-test instead. Additionally, we need to specify the confidence level for our test. Here a level of 95% means that 95% of the times we make conclude a difference exists, it’s actually true. So the higher the level, the more confident you’re about the results but the more difficult it is to prove.
If we play along with our example and expect the control group to have a 13% unsub rate (588/4,523) then if the experiment group has more than 652 unsubs, the difference becomes significant showing that the change is indeed making me lose subscribers.
Now what?
The way I look at it, A/B testing is not the endpoint but actually the starting point where you begin to ask questions about the result. Rather than implementing the change directly(if your test suggests), it’s important to think about the results first. Were there other factors that may have influenced the tests? Like in my case a blog going viral might diminish the number of unsubs. Another thing to look at is the tradeoff between the “lift” brought by the change v/s the costs of implementing it. If everything looks good and the results favor the change then let it rip!
Food for thought:
What other metrics can you think of for the change (both invariant and evaluation)? How would you define them?
How does the unit of analysis affect the test results?
What are the tradeoffs for the sample size? How can you work around with a smaller sample?
If the test results favor my change, what other tests or analyses can I do to validate it?
If the test results favor my change but there’s an additional cost to send emails about multiple blog topics, how would you evaluate if it’s worth it?
Let me know how you’d approach it. If you found this helpful share it. If you’re into this, you can find me on Twitter @abhishek27297 where I talk data.