There is common misconception that troubles most developers using PostgreSQL: tune VACUUM or run VACUUM, and your database will stay healthy. Dead tuples will get cleaned up. Transaction IDs recycled. Space reclaimed. Your database will live happily ever after.
But there are couple of dirty "secrets" people are not aware of. First of them being VACUUM is lying to you about your indexes.
The anatomy of storage
When you delete a row in PostgreSQL, it is just marked as a 'dead tuple'. Invisible for new transactions but still physically present. Only when all transactions referencing the row are finished, VACUUM can come along and actually remove them - reclamining the space in the heap (table) space.
To understand why this matters differently for tables versus indexes, you need to picture how PostgreSQL actually stores your data.
Your table data lives in the heap - a collection of 8 KB pages where rows are stored wherever they fit. There's no inherent order. When you INSERT a row, PostgreSQL finds a page with enough free space and slots the row in. Delete a row, and there's a gap. Insert another, and it might fill that gap - or not - they might fit somewhere else entirely.
This is why SELECT * FROM users without an ORDER BY can return rows in order initially, and after some updates in seemingly random order, and that order can change over time. The heap is like Tetris. Rows drop into whatever space is available, leaving gaps when deleted.
When VACUUM runs, it removes those dead tuples and compacts the remaining rows within each page. If an entire page becomes empty, PostgreSQL can reclaim it entirely.
And while indexes are on surface the same collection of 8KB pages, they are different. A B-tree index must maintain sorted order - that's the whole point of their existence and the reason why WHERE id = 12345 is so fast. PostgreSQL can binary-search down the tree instead of scanning every possible row. You can learn more about the fundamentals of B-Tree Indexes and what makes them fast.
But if the design of the indexes is what makes them fast, it's also their biggest responsibility. While PostgreSQL can fit rows into whatever space is available, it can't move the entries in index pages to fit as much as possible.
... continue reading