Catching Data Issues Before They Catch You
~ Why data issues often go unnoticed and what you can do about it.
A key difference in working on traditional software apps and data products is the ability to identify and resolve issues. Data issues are often discovered in hindsight, either by users interacting with the data or an analyst who’s looking at past data. Why aren’t data issues caught earlier? Let’s take a deeper look.
To start with, in traditional applications, there’s an expected ideal behavior. For example, hitting “post” on instagram should publish your post. However, when you’re dealing with data, there is no ideal state. For example, if you’re reporting on sales, there are expectations of how the data should look like but these are just educated guesses made from either past data or business context. Not having this “ideal” state makes it harder to determine whether the data is accurate or if there's an issue.
Another key component is feedback. Even in traditional software some issues go past the developers into production. However, once the app gets to the users and it’s not doing what it’s supposed to, the issue becomes very clear. On the contrary, data users who rely on the data to make decisions, don't really know what to expect. Hence the feedback signal here is either completely missing or is weak.
To add to the complication, some data issues are only visible from a macro perspective. For example, a small drop in daily conversion rates for an e-commerce store might not seem significant, a weak signal. However, a decline over weeks or months becomes a strong signal. The challenge is knowing which signal to act on and which to ignore.
There are also data issues that are only visible from a temporal perspective. For example, an upward trend in sales over months, can become a downward trend when looked at year-over-year. The temporal component here adds a frame of reference to identify the issue, however it's only clear in hindsight. So, what’s the ideal timeframe for analyzing data then?
Many things that seem to be an issue can be brushed off by saying "It's just data!". In some cases, this is valid as not every data point has an explanation. Many factors influence data and knowing them all is impossible. So how do you know what should you trust? When should you act and when should you not?
Here are some tips that help answer the above questions:
Guardrails for Key Metrics: Set up thresholds for important metrics to help diagnose data issues earlier. Monitor these metrics regularly to catch abnormalities before they escalate.
Data Quality Checks: You can start with basic checks like ensuring no duplicates and validating data types. As you learn more about the data, more advanced checks like monitoring standard deviations can be implemented.
Periodic reviews: It’s important to get your users to look at the data on a regular basis. This can be facilitated via a dashboard for key business metrics that’s reviewed weekly by decision makers. This ensures that major issues are addressed promptly.
Set expectations: Having a frame of reference can help put things in perspective. One way to do this is using models to forecast trends. Two things to note here: focus on the confidence intervals rather than the absolute predictions and don’t rely on long-term predictions as they tend to be unreliable.
Have more ideas? Let me know below.