4 Key Takeaways from 4 Years in Data Engineering in 4 Minutes
It’s been 4 years as a data engineer and today I want to summarize my learnings for folks who may be considering it as a potential career option or for others who are just curious about what data engineers do. So here are my 4 key takeaways from 4 years in data engineering summarized in 4 bullets that’ll take you 4 minutes to read.
Writing Production SQL
Let’s be honest, everyone thinks they know SQL, after your very first database class or online course you think you’re good to go. Now part of this is true, the beauty of SQL is its simplicity. It’s also why SQL has stood the test of time as opposed to other data paradigms that hive-d and fizzled out.
However, while the SQL you know can get you to a point where you can analyze data and answer questions, as a data engineer, you’re not merely answering questions with data but building pipelines that play with data at scale. This requires a bit more skill than merely knowing how to do a ‘SELECT *’. A data engineer understands how data flows through the query and writes SQL that’s also performant, something that can run on a daily or even hourly basis while getting the job done.
You’re considering things like the volume of the data and the scalability of your solution. You’re dealing with terabytes or petabytes of data, so the sub-query or window function that worked for you once will now timeout due to sheer volume. This is where you’ve to get creative with your SQL queries. The price you or your employer pay for bad SQL is not just long run times but also high cloud costs.
Check out my post about how to make your SQL production-ready here.
Being a Data Detective
A big proportion of your time as a data engineer is spent on answering the simple question - “Why this doesn’t match that?”. This usually pertains to data discrepancies arising from the same data looked at from two different sources. For instance, sales data from your internal CRM might differ from that reported by an external payment processor. The minute the data leaves the source it’s subject to change, others can interpret it differently, and so on. As a data engineer, you’re responsible for figuring out all this mess!
This entails looking at what’s going on inside the pipeline, how the data is transformed, how it’s sent to different sources, how they deal with it, and so on. This requires an ability to go deep into things. Being a detective is so prevalent as a data engineer, that I’d go as far as to say if you’re not someone who likes going deep on problems, you might not enjoy data engineering as much. It requires patience as you’re jumping into rabbit holes with no eyes on the end, I guess conspiracy theorists might have what it takes to make good data engineers!
Here’s a step-by-step approach to being a good data detective
Designing Pipelines
With AI more than capable of writing code now if you ever wonder what a data engineer does anymore, the simple answer is design. As Sahil suggests the new wave is being a “Design Engineer”. Designing the architecture for data pipelines is where the human touch is still needed (at least till now :p). This happens to be the most creative aspect of the role as well. Here’s how this usually goes down, you’re notified about a new source of data coming in and how it has to be used. As a data engineer, you’re now thinking about how this will be ingested into your data warehouse, you’re building a data model to store it efficiently, making it available to users and helping them derive value out of it.
Here are a few decisions you’re making in this process - Should I process this data sequentially or in parallel? Should I bother writing intermediate results to staging tables or just use CTEs? Should this be a batch or streaming process? What should the frequency be? How do I only process delta records? How do I deal with PII data? How do I make the pipeline scalable for higher volumes? How do I make sure the data is up-to-date? How often should I refresh the data to prevent stale data? How much historical data should I retain?
The answers to these questions can make or break your data infrastructure and hence good design skills are important. After a certain point, you get the hang of getting things done with good SQL and that’s when the focus shifts to honing design skills and working alongside other engineers to help them get up to speed.
Here’s a problem to get you started on your data design journey.
Bridging Data and Decisions
Data is of no use if it’s not driving business outcomes. This can either be via a report used by executives, a model, or a source for another application. As a data engineer, you are the bridge between the data and the users. This requires you to not only know the whereabouts of the data but also how it’s used and the necessary business context. You need to be able to transition from writing SQL to explaining it in business terms to a stakeholder.
Being the “owner” of the data, you become the go-to contact when things go wrong with the data. Only you see what’s going behind the scenes with the data which makes you the lens through which others look at data. With this power comes the responsibility to represent the data and underlying business logic accurately.
It’s your responsibility to make life easy for other users who bridge the gap from data to business outcomes. Because no matter how well you build it, if it isn’t used correctly or worst isn’t used at all, it all adds up to nothing. Connecting data with business outcomes is also worth it for you. Some things only make sense when you know the complete picture, a perfect give-and-take between you and the users.