Skip to content
Tech News
← Back to articles

Good CTE, Bad CTE

read original more articles
Why This Matters

This article highlights the evolving role of Common Table Expressions (CTEs) in SQL query optimization, emphasizing how recent changes in PostgreSQL 12 have improved their performance by allowing inlining and better integration with query planning. Understanding these nuances is crucial for developers and database administrators aiming to write efficient, high-performing queries in modern SQL environments.

Key Takeaways

The Common Table Expression, or CTE, is often the first feature developers reach for beyond basic SQL, and often the only one. You write a subquery after WITH , give it a name, and use it in the rest of your query. It only exists for the duration of that query.

But the popularity of CTEs usually has less to do with modernizing code and more to do with the promise of imperative logic. For many, CTE acts as an easy to understand remedy for 'scary queries' and way how to force execution order on the database. The way how many write queries is as if they tell optimizer "first do this, then do that".

This creates a problem. CTEs handle query decomposition, recursion and multi statement DDLs. Planner treats them differently depending how you write and use them though. For long time (prior PostgreSQL 12) CTEs acted as optimization fence. The planner couldn't push predicates into them, couldn't use indexes on the underlying tables. Couldn't do anything that materialize them and scan through the result.

PostgreSQL 12 changed this. CTEs now get inlined, materialized, or something in between, depending on how you write them.

Sample schema

We will use the same schema as in the article PostgreSQL Statistics: Why queries run slow.

CREATE TABLE customers ( id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY , name text NOT NULL ); CREATE TABLE orders ( id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY , customer_id integer NOT NULL REFERENCES customers(id), amount numeric ( 10 , 2 ) NOT NULL , status text NOT NULL DEFAULT 'pending' , note text , created_at date NOT NULL DEFAULT CURRENT_DATE ); CREATE TABLE orders_archive ( LIKE orders INCLUDING ALL EXCLUDING IDENTITY ); INSERT INTO customers ( name ) SELECT 'Customer ' || i FROM generate_series ( 1 , 2000 ) AS i; INSERT INTO orders (customer_id, amount, status , note, created_at) SELECT (random() * 1999 + 1 ):: int , (random() * 500 + 5 ):: numeric ( 10 , 2 ), ( ARRAY ['pending','shipped','delivered','cancelled'])[floor(random()*4+1)::int], CASE WHEN random() < 0 . 3 THEN 'Some note text here for padding' ELSE NULL END , '2022-01-01' :: date + (random() * 1095 ):: int FROM generate_series ( 1 , 100000 ); ANALYZE customers; ANALYZE orders;

For recursive examples later on we'll also need an employees table with a self-referencing hierarchy:

CREATE TABLE employees ( id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY , name text NOT NULL , manager_id integer REFERENCES employees(id), department text NOT NULL ); INSERT INTO employees ( name , manager_id, department) VALUES ( 'Alice' , NULL , 'Engineering' ), ( 'Bob' , 1 , 'Engineering' ), ( 'Charlie' , 1 , 'Engineering' ), ( 'Diana' , 2 , 'Engineering' ), ( 'Eve' , 2 , 'Engineering' ), ( 'Frank' , 3 , 'Sales' ), ( 'Grace' , 3 , 'Sales' ), ( 'Hank' , 6 , 'Sales' ), ( 'Ivy' , 6 , 'Sales' ); ANALYZE employees;

The optimization fence era (pre-PG 12)

... continue reading