50 shades of a customer: Building a 360 view of customer with SQL
Backstory:
As you know from my previous posts, I’ve been analyzing data from my blog to provide my users with a better experience. In the last post in this series, we designed an approach to differentiate between new and returning users with an aim to get accurate counts of total users on my blog. If you haven’t checked it out already, you can find that here. This time we’ll work on enhancing our user_profile table (introduced in the last post) to account for profiles from multiple sources and tackle the issues that come with it. What we’ll be doing today will feel like being shown 3 photos of the same person: at a Halloween party, at the beach with shades, and in an early morning class with an aim to figure out if it’s the same person. So let’s revive your ‘spot the difference’ skills and dive in!
Problem:
If you recollect the user_profile table has all attributes for a user, we designed it with only clickstream events as a source. However, I want to extend this table to account for different sources like my mailing list and a possible eCommerce option(if I decide to put on the entrepreneur hat). I still want to be able to identify unique users, so if a user visits my blog(clickstream) and later signs up for my mailing list, I should be able to link these events to denote the same user rather than creating a new user_id which leads to an overestimation of user counts, great for my ego but bad for business.
Data:
We’re working with the user_profile table which looks a little like this:
+---------+--------------+-------------+------------+-----+
| user_id | type | web.ip | web.mac_id | ... |
+---------+--------------+-------------+------------+-----+
| 1 | WEB | 101.102.1.5 | 97tg2r | ... |
| 2 | WEB | 101.102.6.6 | oufg2q | ... |
| ... | ... | ... | ... | ... |
+---------+--------------+-------------+------------+-----+
STOP! Before reading ahead here’s where I want you to take some time and think about the problem yourself before reading my approach.
My approach:
Let’s start by considering my mailing list as another source of capturing user profiles. Now if I want to extend the current user_profile table, I can add columns for fields I want to capture. For example, the email address is a field that I capture when a user signs up for my mailing list, so I’ll add a column for that. The profile_type column will help me capture the source of the profile, “MAIL” in this case. Now if you think about it, adding a new user_id for the mailing list profile will only work if the user hasn’t visited my blog and directly signs up for my mailing list which is a hard sell. For the majority of the users, they’ll first encounter my blog and then sign up. This way they’ll make it to the user_profile table first with the clickstream event and then with the mailing list. If I merely add a user_id, it’ll be counting the same user twice. I need two things to solve this, first a way to decouple the profile from a user such that a single user can have multiple profiles for different sources, and second a way to identify profiles that fall under the same user.
Taking care of the first point is fairly straightforward, to decouple the profile from a customer, I merely need to address them with different IDs. This way I can have multiple profiles that belong to the same customer. Although a single unique profile can only belong to one customer. Here’s what this would look like:
+---------+------------+------+-------------+-------------------+
| user_id | profile_id | type | web_details | mail_list_details |
+---------+------------+------+-------------+-------------------+
| 1 | 1 | WEB | ... | ... |
| 1 | 2 | MAIL | ... | ... |
| 2 | 1 | WEB | ... | ... |
| ... | ... | ... | ... | ... |
+---------+------------+------+-------------+-------------------+
Next, we need to come up with an approach to identify profiles that belong to the same user. Let’s start by thinking about the customer journey, when a user visits my blog, I capture this event with clickstream data and create a record in the user_profile table assigning a new user_id. Now if the same user finds my blogs interesting and decides to sign up for my mailing list (if that’s true for you too, you can do that here), I capture this event in my subscriber_feed_history table which then also feeds into the user_profile table to create a record. The task at hand is to determine that these two users are indeed the same person. To do so we need to define rules to match the two different profiles to conclude they belong to the same user. Now you can make these rules really elaborate as you learn more about your users but the genesis should be identifying common attributes that help us make this decision.
+---------+------------+------+-------------+-------------+-----+
| user_id | profile_id | type | web.ip | mail.ip | ... |
+---------+------------+------+-------------+-------------+-----+
| 1 | 1 | WEB | 101.102.1.5 | - | ... |
| 2 | 2 | MAIL | - | 101.102.1.5 | ... |
| ... | ... | ... | ... | ... | ... |
+---------+------------+------+-------------+-------------+-----+
For the above example, let’s assume the common fields are the IP address and device ID. Now I can define a rule that when these fields have the same value for two different profiles, I will consider them as the same user. You may argue that matching profiles based on just these attributes doesn’t sound correct, this is exactly why big social media apps collect all the data they can, to know you better. Since what we do here is more benign, we’ll stick with just the two. Once you’ve identified the profiles to match (profile_id 1 & 2 here) you’ll need to decide which user_id should you pick, the one associated with the WEB profile or MAIL profile? Now you can make a hard call here like to go with the WEB user_id all the time which works but a data-driven approach is to come up with rules again. For example, we can pick users that have more associated clicks in the user_clickstream table which ties a clickstream event to the best user given different profiles. If there’s a tie, we can pick the user based on the profile type, like MAIL precedes WEB. Let’s see what this looks like in SQL:
-- Web profiles(derived from clickstream)
WITH web_profiles AS (
SELECT user_id, profile_id, web.ip AS ip,
web.device_id AS device_id
FROM user_profile, UNNEST(web) AS web
WHERE type = "WEB"
),
-- Mail profiles(derived from mailing list)
mail_profiles AS (
SELECT user_id, profile_id, mail.ip AS ip,
mail.device_id AS device_id
FROM user_profile, UNNEST(mail) AS mail
WHERE type = "MAIL"
),
-- Web & Mail profiles that match on the criteria
web_x_mail_matches AS (
SELECT wp.user_id AS web_user_id,
mp.user_id AS mail_user_id,
ARRAY_CONCAT_AGG([wp.profile_id,mp.profile_id])
AS profiles_to_merge
FROM web_profiles wp
JOIN mail_profiles mp
ON wp.ip = mp.ip
AND wp.device_id = mp.device_id
WHERE wp.user_id <> mp.user_id
GROUP BY wp.user_id, mp.user_id
),
-- Visit counts for users
visits AS (
SELECT user_id, COUNT(1) AS visit_count
FROM user_clickstream
GROUP BY user_id
)
-- Deciding final user_id for both profiles
SELECT
CASE WHEN v1.visit_count > v2.visit_count
THEN wp.user_id
WHEN v2.visit_count > v1.visit_count
THEN mp.user_id
ELSE mp.user_id -- both are equal
END AS final_user_id,
profiles_to_merge
FROM web_x_mail_matches wmp
LEFT JOIN visits v1
ON wmp.web_user_id = v1.user_id
LEFT JOIN visits v2
ON wmp.mail_user_id = v2.user_id
Let’s break down what we have here, first we start by matching the two different profiles on the earlier criteria. Then we pick pairs that match but have different user_ids. Finally, to figure out which user_id wins, we take into account the visit count for each user and choose the user_id for the profile with the most, if there’s a tie(ELSE statement) we pick the mailing list user_id as default. As a result, we have the final user_id and an array of profiles that should be under this user, like this:
+---------------+-------------------+
| final_user_id | profiles_to_merge |
+---------------+-------------------+
| 1 | [1,2] |
| ... | ... |
+---------------+-------------------+
Before you run off with this solution, let’s think about what more can we do to match profiles. What we did so far works great to match different profiles, but what if a user signs up twice for my mailing list with two different emails? Although there are two emails used, they both belong to the same user. Here’s where we can extend our process by doing intra-profile matching, basically converging multiple profiles of the same type into one. This matching can be a little tricky but not impossible and having some trivial checks is better than nothing. Finally, we don’t need to change the customer creation process designed last time, but we can create a new subsidiary process that performs this matching and sends the results back to update our database with the desired changes.
Food for thought:
What can be some other sources of user acquisition?
Can you think of some intra-matching rules for the web & mail profiles?
How can I generalize the matching rules? For when I add more profiles in the future like eCommerce.
How will I match these eCommerce profiles with the rest?
What would the user_id correction process look like?
Let me know how you’d approach it. If you found this helpful, share it. If you’re into this, you can find me on Twitter @abhishek27297 where I talk data. If you don’t want to miss my next post, consider subscribing here. It’s free and you’ll be notified about when I post my next blog.