A SQL Heuristic: Ors Are Expensive

A SQL Heuristic: ORs Are Expensive Query planning is hard. Sometimes. Queries often have more than one filter (using an and clause). But the developer can often end up in situations requiring an or clause: select count(*) from application where submitter_id = :user_id or reviewer_id = :user_id; But this is slow! With 1,000,000 applications and 1000 users uniformly distributed on both columns, it takes over 100ms.1 If we rewrite it with only and s select ( select count(*) from application a where a.reviewer_id = :user_id ) + ( select count(*) from application a where a.submitter_id = :user_id ) - ( select count(*) from application a where a.submitter_id = :user_id and a.reviewer_id = :user_id ); This takes less than 1ms; Over 100 times faster! 12 This is surprising — we have indexes on the filtered columns. But why? First, we need to develop some intuition about how and when indexes are used. Basic Query Planning Intuition Say you have a table with two columns and an index on each column. create table example_table ( text_column varchar(2) not null, num_column int8 not null ); Doing a simple lookup via a column is similar in time complexity to doing a lookup with a BTreeMap or BTreeMap . And if we use a compound index — (text_column, num_column) , it's like doing a lookup with BTreeMap> . So filtering with an and (or filtering by the first, then sorting by the latter) is natural with a compound index. Now say we don't have a compound index, only individual indexes, then we have to be clever. We can use database statistics. If we want to check where text_column = 'ES' and num_column = 1 , there's two options: lookup 'ES' , then filter the results by the num_column lookup 1 , then filter by text_column A couple problems to note here: If there's many 'ES' s relative to the number of 1 s, or vice-versa, and we pick the wrong plan, we're reading more data from disk than we need. Remember, indexes are stored in chunks or pages on disk and that i/o is usually the bottleneck. s relative to the number of s, or vice-versa, and we pick the wrong plan, we're reading more data from disk than we need. Remember, indexes are stored in chunks or pages on disk and that i/o is usually the bottleneck. CPU and memory matter too! Loading large amounts of data into memory and processing it can be expensive. This gets worse as we start joining to other tables. To help us figure out the order, statistics can tell us how often each value shows up, helping us pick between the two options. So now, our query plan will more strongly depend on the relative distribution of the two columns, not only our schema and scale.3 Say we have a lot of 1 s, but not many 'ES' s. We look at our statistics, which tell us that 1 is the most common value for the num_column . But we don't see 'ES' in our list of common values for the text_column , based on that column's statistics.45 All we need now is an index on the more specific side. Why ORs are Expensive Doing this with an or is a bit trickier. Assume we're doing text_column = 'ES' or num_column = 1 . Two options: Do each filter separately, then merge them, effectively turning this into a union distinct . . Iterate through the all the data (i.e. a sequential scan), filtering the rows in memory. Statistics can still help when 'ES' and 1 are both rare. We can do a lookup in the index (a page read) and read those rows (another page read), which is better than scanning the entire table (n page reads). Doing this twice for each filter and merging them with a Bitwise Or, we avoid the sequential scan, though it's still expensive — see our very first example. Importantly, this idea of merging with Bitwise Or can be brittle. If we filter all the data with an extra condition (an and or a sort on a third column), then we need a compound index for each column (with that new column second). Otherwise, we'll be iterating through everything with a sequential scan. Try rewriting the query as a union (i.e. the or on the outside). (x or y) and z = (x and z) or (y and z) .6 Also, an and (or intersection) decreases the amount of data, whereas an or (a union) will increase the amount of data returned, increasing the cost.7 On a practical note, the most common operations in a relational database tend to be individual lookups or and s. E.g., Faceted search in an e-commerce site. Any good query planner need focus on optimizations for and s. Practical Tips There are two common issues in schema design that I see often: A table with multiple columns with the same type and constraints, usually a foreign key to the same table Representing sum types as different tables Both of which are variations on the same concept: put the same data together (even if you have to give up some constraints). Just Add Another Table Going back to the first example: create table application ( application_id int8 not null, submitter_id int8 not null, reviewer_id int8 not null ); To query all application s that attached to a given user, we can create a secondary table to hold those users in a one-to-many relationship: create table application_user ( user_id int8 not null, application_id int8 not null, user_type enum ('submitter', 'reviewer') not null ); We can rewrite our query to follow the indexes like this: :user_id -> application_user -> user . select * from application a join application_user au using (application_id) where au.user_id = :user_id This is linear, making it very easy for the query planner. Now it's faster than the Bitwise Or that was used in our first example. Inheritance and Keyset Pagination Say we have two tables with a shared set of columns: create table post ( user_id int8 not null, title text not null, body text not null ); create table comment ( user_id int8 not null, body text not null, parent_id int8 not null ); If we want to find everything written by a user, the best we can do is rewrite it with a proper union . But if we restructure our tables, we can "inherit" or "extend" from a base table, having one index for each shared column. create table writing ( writing_id int8 not null, user_id int8 not null, body text not null ); create table post ( writing_id int8 not null, title text not null, foreign key writing_id references writing ); create table comment ( writing_id int8 not null, parent_id int8 not null, foreign key writing_id references writing ); This isn't strictly necessary for all use cases; we can still do a union to get all the matches. For instance, keyset pagination: select * from ( select p.post_id, null comment_id, p.title, p.body, null parent_id from post p where p.user_id = :user_id and p.post_id > :post_id_start order by p.post_id limit :page_size union select null post_id, c.comment_id, null, c.body, c.parent_id from comment c where c.user_id = :user_id and c.comment_id > :comment_id_start order by c.comment_id limit :page_size ) keyset order by keyset.post_id, keyset.comment_id limit :page_size However, this is way more complex and easy to get subtly wrong. So, I always opt for the extension tables. The Moral Schema design is all about access patterns: What searches/multi-row scans are being done Read-heavy vs write-heavy Same tables, could there be contention? etc. If you're never going to do an or , there's no need to worry about this! Alas, requirements change... Addendum Questions:

A SQL Heuristic: Ors Are Expensive

Share this article

Related Articles