Mar 11, 2026 · 1223 words · 6 minute read
My friend Henrietta Dombrovskaya pinged me on Telegram. Her production cluster had just been killed by the OOM killer after eating 2 TB of RAM. work_mem was set to 2 MB.
Something didn’t add up.
Hetty, like me, likes playing with monster hardware. 2 TB of RAM is not unusual in her world. But losing the whole cluster to a single query during peak operations is a very different kind of problem from a 3am outage. When the OOM killer strikes at the worst possible moment, you need answers fast.
One important detail: the memory log I’ll show below is not from the production incident. Hetty reproduced the behavior on a separate server to investigate. She stopped the query before the OOM killer struck that time. The production cluster was not so lucky.
I want to point this out right away: this is the kind of problem you solve faster with a good network than with a good search engine. Hetty is a brilliant Postgres expert. We puzzled through this together. I’m writing it up because you’ll run into it too, and because the behavior of Postgres memory management is genuinely surprising.
The tool that saved the day 🔗
Before we dig into the “why”, let me introduce you to a function I didn’t know existed until that conversation: pg_log_backend_memory_contexts .
Pass it a PID. Postgres will dump the full memory context tree of that backend into the logs. Every allocation. Every context. Sizes and chunk counts included.
select pg_log_backend_memory_contexts ( 299392 );
... continue reading