How to play a Data Detective
From my 3 years in a data role so far, I’d say one of the most essential skills you need across any data role be it an analyst, engineer, or a scientist, is being a good data detective. With data being data, one thing you’ll soon realize when working with it is that it’s never what you expect it to be. This could be numbers not matching in your report, bad data in your pipeline, or even bad feature data for your models. As the data detective, you will be tasked with the question of “WHY” it’s not what everyone expects it to be. So on that note, today I’ll be going over some tips that have helped me be a good data detective with an aim that these may come in handy for you too.
Know Your Data
I know at this point I sound like a broken record but I'll say it again, knowing your data is the most important thing when working with data. By "know your data", I mean comprehending it from both business and engineering perspectives. Let's say you're working with a table, here are some things you should know:
Business context:
What use case was it built for?
How is this data used today?
What business impact does it have?
What does it mean if this data is stale? Is it still good or unusable?
What are the different fields in the table and what are their business definitions? Note: Often looking at the field name is not enough.
Engineering context:
What's the source of this data?
What is the granularity or the key identifier for the table?
Is this an aggregated table?
If so, what was the source table? What was its granularity and how was the data aggregated?
How frequently is the data in this table updated?
How are the different fields generated? Are they passed directly from the source or are they calculated? What are the data types and how is missing data handled?
Only after you've answered most of these questions, are you in a position to use the table in any capacity. This small exercise will save you a lot of headaches later on and the knowledge gained here will come in handy whenever you revisit the particular table. Now you'll know how to use this table in more context beyond your current use-case.
Look at the Source and its Source
Great, now that you know the data well, you can jump into the data issue at hand. This issue could range from incorrect or missing data to data that hasn't been updated or doesn't align with your expectations. In most of these cases to get to the bottom of the issue, you'll need to go deeper into the data and that means going into its source to see how the data ends up in the table in the first place.
Now the source of a table can be different things like a file from a vendor, a Kafka topic that streams data from an application, a database, a data pipeline, another table in your data warehouse, or a combination of all these. In most of the cases, your data issue can be traced back to the source. Here, you'll define the next steps for resolution, and here are some potential scenarios to consider:
Issue Originates at the Source: If you determine that the data issue originates at the source, it may be necessary to engage the relevant authority responsible for the source. Resolving the issue at the source can prevent it from recurring downstream.
Issue Lies in Data Usage: Sometimes, the source data is accurate, but the problem arises from how the data is used downstream, in which case see below.
Deeper Investigation Required: If the solution isn't immediately apparent, you may need to dig even deeper by exploring the source of the source. This recursive process can lead you down a similar investigative path. In my experience, it's often when you go two to three layers deep that you make significant breakthroughs, so patience is key!
Practice the Data Point of View (POV)
Now in most cases, your data undergoes various transformations and aggregations as it moves from the source to its destination. Hence the next line of action is to dive into this transformation logic and the best way to do this is by breaking it down. The goal is to understand how data flows by stepping through all the steps and identifying the specific part that may be causing the issue. This approach, which I like to call practicing the data POV, involves the following key strategies:
Don't just eyeball it: Avoid merely eyeballing the SQL query because even if you think you know how the data flows through your queries, data can surprise you. Here's one example of this: While working on a data issue, I encountered SQL that made sense in every way however its result didn't. After breaking it down, I narrowed down the issue to a JOIN which looked correct as it was being done on the same field, same data types, same valu... Ah! turns out one of the values passed in the JOIN was slightly different in both the tables ("text" and "txt") which was breaking the whole query.
Look at the complete picture with a narrow lens: Run through all the data transformation steps using sample data. This could be picking a specific user_id and only observing how the data changes for this user. However, a common trap to watch out for is if your logic behaves differently for different types of users. This is where knowing your data again comes in handy yet again.
Freeze it before: Another thing I practice while doing the data POV is freezing the data to narrow the scope. This could be done in multiple ways for example copying over data to a test table while you look into it. This makes sure the data you're looking at is static and the results of your analysis are repeatable. It's hard to diagnose issues if your data is changing.
Let me know how you’d approach it. If you found this helpful, share it. If you’re into this, you can find me on X @abhishek27297 where I talk data.
While in school, my understanding of where #data lives was only limited to CSV files. However, I soon realized the data landscape in organizations is far more complex.
Check out my latest blog, I delve into the two primary realms where data resides. https://t.co/kQgbE4Yze8— Abhishek Singh (@abhishek27297) September 17, 2023
If you don’t want to miss my next post, consider subscribing here, it’s free and you’ll be notified when I post my next blog.