Beyond indexes: How open table formats optimize query performance

My SELECT * FROM Users WHERE LastName = ‘Martinez’ will use a lookup operation, which is a: Seek on the secondary index B-tree to obtain the clustering key (PK) of the matching Martinez row. A seek on the clustered index (the table itself) using the clustering key (PK) to retrieve the full row. Selectivity is important here. If the secondary index scan returns a small number of rows, this random IO of multiple index seeks is worth it, but if the scan returns a large number of rows, it might actually be quicker to perform a clustered index scan (full table scan). Imagine if the table had 1 million rows and the distribution of the Country column matched the figure above, that is to say roughly 800,000 rows had ‘Spain’ in the Country column. 800,000 lookups would be far less efficient than one table scan. The degree of selectivity where either lookups from a secondary index vs a clustered index scan can be impacted depending on whether the database is running on a HDD or SSD, with SSDs tolerating more random IO than a HDD. It is up to the query optimizer to decide. Secondary indexes are ideal for high selectivity queries, but also queries that can be serviced by the secondary index alone. For example, if I only wanted to perform a count aggregation on Country, then the database could scan the Country secondary index, counting the number of rows with each country and never need to touch the much larger clustered index. We’ve successfully reduced the amount of IO to service the query and therefore sped it up. But we can also add “covering columns” to secondary indexes. Imagine the query SELECT Nationality FROM User WHERE Country = ‘Spain’. If the user table is really wide and really large, then we can speed up the query by adding the Nationality column to the Country secondary index as a covering column. The query optimizer will see that the Country secondary index includes everything it needs and decides not to touch the clustered index at all. Covering columns make the index larger which is bad, but can remove the need to seek the clustered index which is good. These secondary indexes trade off read performance for space, write cost and maintenance cost overhead. Each write must update the secondary indexes synchronously, and sometimes these indexes get fragmented and need rebuilding. For an OLTP database, this cost can be worth it, and the right secondary index or indexes on a table can help support diverse queries on the same table. Column statistics Just because I created a secondary index doesn’t mean the query optimizer will use it. Likewise, just because a secondary index exists doesn’t mean the query optimizer should choose it either. I have seen SQL Server get hammered because of incorrect secondary index usage. The optimizer believed the secondary index would help so it chose the “lookup” strategy, but it turned out that there were tens of thousands of matching rows (instead of the handful predicted), leading to tens of thousands of individual clustered index seeks. This can turn into millions of seeks in queries if the Users table exists in several joins. Table statistics, such as cardinalities and histograms are very useful here to help the optimizer decide which strategy to use. For example, if every row contains ‘Spain’ for country, the cardinality estimate will tell the optimizer not to use the secondary index for a SELECT * FROM Users WHERE Country = ‘Spain’ query as the selectivity is not low enough. Best to do a table scan. Histograms are useful when the cardinality is skewed, such that some values have large numbers of rows while others have relatively few. RDBMS summary There is, of course, far more to write about this topic, way more nuances on indexing, plus alternatives such as heap-based tables, other trees such as LSM trees, and alternative data models such as documents and graphs. But for the purpose of this post, we keep our mental model to the following: Workload : optimized for point lookups, short ranges, joins of a small number of rows, and frequent writes/updates. Queries often touch a handful of rows, not millions. Data organization : tables are row-based; each page holds complete rows, making single-row access efficient. Clustered index : the table itself is a B-tree sorted by the primary key; lookups and joins by key are fast and predictable (O(log n)). Secondary (non-clustered) index : separate B-trees that map other columns to rows in the clustered index. Useful for high-selectivity predicates and covering indexes, but costly to maintain. Statistics : cardinalities and histograms guide the optimizer to choose between using an index or scanning the table. Trade-off: indexes cut read I/O but add write and maintenance overhead. In OLTP this is worth it, since selective access dominates. Perhaps the main point for this post is that secondary indexes allow for one table to support diverse queries. This is something that we cannot easily say of the open table formats as we’ll see next. Data organization, indexes and auxiliary structures in the open table formats Analytical systems have a completely different workload from OLTP. Instead of reading a handful of rows or updating individual records, queries in the warehouse or lakehouse typically scan millions or even billions of rows, aggregating or joining them to answer analytical questions. Where row-based storage in a sorted B-tree structure makes sense for OLTP, it is not efficient for analytical workloads. So instead, data is stored in a columnar format in large contiguous blocks with a looser global structure. Secondary indexes aren’t efficient either. Jumping from index to row for millions of matches would already be massively inefficient on a file system, but given OTFs are hosted on object storage, secondary indexing becomes even less useful. So the analytical workload makes row-based B-trees and secondary indexes the wrong tool for the job. Columnar systems flipped the model: store data in columns rather than rows, grouped into large contiguous blocks. Instead of reducing IO via B-tree traversal by primary key, or pointer-chasing through secondary indexes, IO reduction comes from effective data skipping during table scans. The columnar storage allows for whole columns to be skipped and depending on a few other factors, entire files can be skipped as well. Most data skipping happens during the planning phase, where the execution engine decides which files need to be scanned, and is often referred to as pruning. There are different types of pruning and as well as different levels of pruning effectiveness which all depend on how the data is organized, what metadata exists and any other auxiliary search optimization data structures that might exist. Let’s look at the main ingredients for effective pruning: the table structure itself and the auxiliary structures. The table structure: metadata + data files Let’s focus on Iceberg for now to keep ourselves grounded in a specific table format design. I did an in-depth description of Iceberg internals based on v2 which is still valid today, if you want to dive in deep. To keep things simple, we’ll look at the organization of an Iceberg table at one point in time.

Beyond indexes: How open table formats optimize query performance

Share this article

Related Articles