Beyond indexes: How open table formats optimize query performance

My SELECT * FROM Users WHERE LastName = ‘Martinez’ will use a lookup operation, which is a:

Seek on the secondary index B-tree to obtain the clustering key (PK) of the matching Martinez row. A seek on the clustered index (the table itself) using the clustering key (PK) to retrieve the full row.

Selectivity is important here. If the secondary index scan returns a small number of rows, this random IO of multiple index seeks is worth it, but if the scan returns a large number of rows, it might actually be quicker to perform a clustered index scan (full table scan).

Imagine if the table had 1 million rows and the distribution of the Country column matched the figure above, that is to say roughly 800,000 rows had ‘Spain’ in the Country column. 800,000 lookups would be far less efficient than one table scan. The degree of selectivity where either lookups from a secondary index vs a clustered index scan can be impacted depending on whether the database is running on a HDD or SSD, with SSDs tolerating more random IO than a HDD. It is up to the query optimizer to decide.

Secondary indexes are ideal for high selectivity queries, but also queries that can be serviced by the secondary index alone. For example, if I only wanted to perform a count aggregation on Country, then the database could scan the Country secondary index, counting the number of rows with each country and never need to touch the much larger clustered index. We’ve successfully reduced the amount of IO to service the query and therefore sped it up.

But we can also add “covering columns” to secondary indexes. Imagine the query SELECT Nationality FROM User WHERE Country = ‘Spain’. If the user table is really wide and really large, then we can speed up the query by adding the Nationality column to the Country secondary index as a covering column. The query optimizer will see that the Country secondary index includes everything it needs and decides not to touch the clustered index at all. Covering columns make the index larger which is bad, but can remove the need to seek the clustered index which is good.

These secondary indexes trade off read performance for space, write cost and maintenance cost overhead. Each write must update the secondary indexes synchronously, and sometimes these indexes get fragmented and need rebuilding. For an OLTP database, this cost can be worth it, and the right secondary index or indexes on a table can help support diverse queries on the same table.

Column statistics

Just because I created a secondary index doesn’t mean the query optimizer will use it. Likewise, just because a secondary index exists doesn’t mean the query optimizer should choose it either. I have seen SQL Server get hammered because of incorrect secondary index usage. The optimizer believed the secondary index would help so it chose the “lookup” strategy, but it turned out that there were tens of thousands of matching rows (instead of the handful predicted), leading to tens of thousands of individual clustered index seeks. This can turn into millions of seeks in queries if the Users table exists in several joins.

Table statistics, such as cardinalities and histograms are very useful here to help the optimizer decide which strategy to use. For example, if every row contains ‘Spain’ for country, the cardinality estimate will tell the optimizer not to use the secondary index for a SELECT * FROM Users WHERE Country = ‘Spain’ query as the selectivity is not low enough. Best to do a table scan.

... continue reading