Today I’ll be going over the steps I follow when trying to optimize a SQL query. If you work with data in any capacity, chances are you’re writing SQL queries to retrieve or manipulate data somehow. So why should you optimize your queries? The answer is simple: performance. If your query is meant for a production environment like a data pipeline or an enterprise dashboard, optimization is crucial.
Running an unoptimized query in production can lead to multiple issues like your query blocking resources for other queries or taking too long to execute, either of which affects users relying on the data. Beyond that, unoptimized queries also use a lot of compute resources driving up costs. Hence to avoid a big cloud bill amidst the terrors of a booming recession, it’s time to optimize your queries. So let’s jump in.
Make it Readable
This first step has nothing to do with performance but has everything to do with your and others’ sanity while working with the query. Just leaving comments about what the query is doing or a link to the documentation isn’t enough. So let’s talk about what it means to make a query readable.
Use Common Table Expressions(CTEs): CTEs, a low-hanging fruit I’m a fan of, inherently make the query readable by breaking it into chunks. The best way to do so is by using the business logic. So no more nested sub-queries😝.
Format your query: Another easy win is to use tabs or spaces to section out the query making it look pretty, nobody likes to read a one-liner query🙄.
CAPITALIZE SQL KEYWORDS: This one is a little extra but I still recommend it. This makes it easier to spot tables, columns, and filters at a glance.
Maintain functionality: Something to keep in mind when making a query readable, especially when refactoring, is to ensure the functionality is intact. You don’t want a readable query that doesn’t do what it’s supposed to do.
This step might take time, especially with larger queries, but it’s an investment that pays off in the long run. Here is an example of a well-formatted, easy-to-read query.
Avoid redundant table scans
Once we have our query all nice and sectioned out using CTEs, now it’s time to look at which tables are used and how frequently. Table scans can really slow down a query especially if you're working with large or multiple tables.
A good approach is to only use a particular table utmost one time in your query. This avoids the redundancy of going to the same table multiple times to get different data. This gets especially challenging when you need data at different levels of granularity from the same table, like weekly and monthly data. To avoid this, I recommend grabbing data at the finest grain and then aggregating it up.
Another thing to be cautious of is full table scans (reading the entire table) which can be resource-intensive. The next tip will cover how to avoid these.
Filters, partitions & indexes
Filters: The major principle here is that your query should only see the data it absolutely NEEDS to see and no more, every other row or column should be excluded using filters. I recommend using filters in every CTE to limit data that flows through the query later on.
Partitions: Whereas filters limit the data your query works with, partitions take it a step further by dividing your data into chunks, making it easier to deal with. If a table is partitioned, you MUST use the partition in your query. If you don’t have a good idea of how to use it, err on the side of caution by partitioning more generously e.g., use months instead of weeks, or years instead of months.
Indexes: While more common in the OLTP realm, these can be leveraged to speed up search queries. Especially if you know your users frequently use a particular column for filtering, leveraging an index might be worthwhile.
Revisit the logic
What if the business logic itself requires a very broad partition or worse requires you to use all available data? Let’s take an example, you’re trying to find out the last time a customer opened a marketing email and you have email clicks data that goes back to 2010. A customer might have opened an email a week ago, a year ago, or a decade ago, there are no limits! How do you optimize the query in this case?
Well at this point the question you really need to ask is “Is this really needed?” and make a tradeoff between data accuracy vs performance. In our example, this could be only looking at the last 6 months of data. Is this truly accurate? No. But does it get the job done realistically? Yes. The decision depends on your specific use case, but it’s a discussion worth having.
Use the best data source
When revisiting the business logic, another great question to ask is “Am I using the best source(table) for this data?”.
Example 1: If your query is pulling a single attribute, like a customer’s address, from a large denormalized table, you might consider switching to a smaller, normalized table that contains only the relevant information.
Example 2: Conversely, if your query needs multiple customer attributes, sticking with a denormalized table might be better, as it avoids the need for complex JOINs across several normalized tables.
In both scenarios, although the data remains the same, picking the better source can greatly impact performance.
Break it down
Now if you do all the above and still run into performance issues, another reason to consider is maybe you’re trying to do too much in a single query. There’s rarely a constraint that forces you to get to the outcome in a single query. It’s time to break it down!
Section your query into smaller chunks, write intermediate results to staging tables, and then put everything together. With this approach, you reduce the load on the final query. Memory usage is reduced since now intermediate results are more accessible through static tables as opposed to being calculated in memory.
Addressing failures
Finally, I want to leave you with some tips if you’re optimizing a query as a result of a failure that has occurred with an existing query. A common failure caused by an un-optimized query is “Resources exceeded during query execution”, said otherwise your query is taking too long or doing too much. All the previously mentioned steps should help and I’d start there.
The only thing I’ll add to those is to narrow down the cause of the issue. If you’ve used CTEs, try to figure out which one is taking the longest to run and hence causing the whole query to wait for it. This might be the broken link in your query and addressing this may fix your problem.
A common area that causes this problem is JOINs, they can significantly slow down a query, so revisit them to ensure they're necessary and optimized. Another common culprit is window functions. If your window is too short it becomes a problem with large datasets and if it’s too broad, it’s a problem with complex operations. These are harder to resolve and may require further data restriction or a re-evaluation of the logic they’re trying to implement.
With all this knowledge you should now be a query-optimizing wizard. Check out the blog below where I talk about how to approach data issues.