Where does data live? Database v/s Data Warehouse
Introduction
While in school, my understanding of where data lives was only limited to CSV files. Little did I know that the data landscape in big organizations is far more complex and dynamic. Today we’re delving into the two primary realms where data resides: the traditional Database and the powerful Data Warehouse, how they differ from each other, and what you need to know to get started with both. Let’s explore where data finds its home!
The Application
A Hub for Transactions: Databases are mainly used for transactional purposes alongside software applications as a store for data. Imagine an application that needs particular information like a user’s zip code while processing a transaction or it generates specific information like a user’s email which needs to be stored. An API is used for the interaction between the database and the app.
Insights for Analytics: In contrast, a Data warehouse is used for analytical purposes. Rather than feeding an application, the data here is used for generating reports or predictive models. Unlike a database, you won’t expect to get data for a real-time application. So no APIs will be hitting these, what will be hitting these are SQL queries to fetch data for analytics.
State of data
Present and Performance: Databases usually only house current or latest data, beyond this little to no historical data is stored. This keeps the amount of data comparatively less making it easier to yield real-time results keeping the focus on performance. For example, if a user relocates, you’d typically only store their latest zip code here.
Bridging Past and Present: All the history not stored in a database flows into the data warehouse, which will have both current and historical data. The volume of data here isn’t an issue, especially given the storage costs for cloud solutions. On the contrary, you do need more data for better analytics. For seamless data continuity, the database converses with the data warehouse, ensuring historical data is seamlessly woven.
Data Sources
Early Data Landing: A database plays a pivotal role as the initial touchpoint for incoming data. Directly engaging with the source of data, like a sales or marketing system, it becomes the first home for this freshly generated information.
The Data Odyssey: For a data warehouse, as it accumulates history, the main source of data is the database. However, if your data is purely factual in nature, for example, purchases on an e-commerce platform, it could skip the database altogether, given the volume and its strategic use. Here, the data warehouse establishes a direct rapport with the source. It’s important to note that not all organizational data needs to reside in the database, but the data warehouse eventually aspires to encompass it all.
Type of (SQL) Queries
Short and sweet: While interacting with a database, your SQL queries are designed to be short and sweet and do a very specific task like an INSERT or a LOOKUP. Don’t expect to use your fancy SQL skills here as this data is mainly meant to be used by applications. Queries here should be limited to basic SELECTs and UPDATEs. An example could be where your application is looking for a user’s zipcode to verify their purchase:
SELECT zipcode FROM user_profile WHERE user_id = <passed from app>
SQL’s wild west: In the data warehouse world things get exciting and complicated. Here the aim is not to merely fetch data but to transform or aggregate it to make it useful. This is where SQL shows its real strength for manipulating data. Expect things like dealing with multiple tables, WINDOW functions, and complex aggregations. An example could be getting the average of the user’s last five purchases:
SELECT user_id,
AVG(purchase_amount) OVER(
PARTITION BY user_id
ORDER BY purchase_date DESC
ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
) AS avg_last_5_purchases
FROM sales
Performance Optimization
Accelerating Responsiveness: To get the best performance out of a database you do things like indexing on a particular column which makes the lookup queries faster for that column like a well-organized library guiding you to your desired book. This optimization ensures speedy lookups that are crucial for applications to deliver seamless user experiences. Slow wait times? Not in the database world.
Divide and conquer: For a data warehouse, where the amount of data is way more, you think about things like partitions. Think about dividing an ocean into manageable pools for specific swimmers, allowing queries to focus solely on the necessary data, boosting efficiency. Other query optimization techniques like avoiding correlated queries and paying attention to JOINs ensure smooth navigation in this universe.
Data Quality
Precision is Paramount: Quality of data is a bigger concern in a database, let me explain. If you recollect from the sample queries, databases often fetch very particular data. The results here drive real-time decisions and customer interactions. For example, Facebook misjudging a birthday may cost you friends. The stakes are high, and each detail counts.
The Big Picture: On the other hand, data in a data warehouse is usually aggregated hence the accuracy of individual records isn’t a big concern here. You’re rarely going to look at individual data points here nor should you. Think of these as paintings with broad strokes — accuracy in every individual pixel is less crucial than the vividness of the whole canvas. That being said you still want this to be close to the truth for overall data hygiene and to not misinterpret trends.
Support effort
Unwavering Uptime: Support needed to deliver your data is critical when it comes to databases. As applications can query data anytime, even a flicker of downtime could disrupt operations. Picture it as a live TV broadcast; a momentary blackout spoils the experience. Thus, ensuring zero to minimal downtime is paramount in the database world.
Timing is key: Support for the data warehouse takes a different route. While real-time demands ease up, timing still matters. Imagine running a report for an executive, only to find the latest data missing due to downtime. However, when the world sleeps, the data warehouse maintenance matters less, as no one’s watching. The focus shifts from constant uptime to strategic timing.
Users and technologies
Engineer’s Realm: In the database world, software engineers and database admins are the stars of the show. The former use it as a part of their applications while the latter maintain it. The tech landscape revolves around SQL and NoSQL solutions like MySQL and MongoDB. The data pipelines here are often Java-based micro-services that talk to the application and the database.
Home for Insights: Anyone in the org who wants to interact with data should be pointed to the data warehouse. The audience includes data engineers, scientists, analysts, or business users. Cloud-based services employing a Big Data paradigm — think BigQuery, Snowflake — form the infrastructure backbone. Data pipelines, facilitated by tools like Airflow and dbt, orchestrate the flow, utilizing the colossal potential of data.
TL;DR
Metric Database(DB) Data Warehouse(DW) Application Real-time data for applications Analytical hub for reports and predictive models Interaction API calls SQL queries Data State Latest data, minimal history Current and historical data aggregation Data Source Directly interacts with true data source Relies on databases or interacts directly with source SQL Queries Short, specific queries for applications Complex queries involving data transformation SQL Operations Simple SELECT, INSERT, and DDL Aggregations, Window functions, Multi-JOINs Performance Indexing for faster lookups Partitions and query optimization Data Quality Precision for real-time decisions and app use Aggregate accuracy, overall data hygiene Support Zero to minimal downtime for apps Timing-sensitive support; less real-time pressure Users Engineers, DB Admins Data engineer, scientists, analysts; Business users Technologies MySQL, MongoDB, micro-services BigQuery, Snowflake, Airflow, dbt
Let me know how you’d approach it. If you found this helpful, share it. If you’re into this, you can find me on X @abhishek27297 where I talk data.
If you don’t want to miss my next post, consider subscribing here, it’s free and you’ll be notified when I post my next blog.