Our team members built and operated five managed PostgreSQL services over the past 15 years. Across all of them, one configuration has remained constant: strict memory overcommit. In this blog post, we will explain how strict memory overcommit protects your database from catastrophic OOM (out of memory) kills. We will also share how a three-character kernel bug forced us to temporarily disable this setting. Finally, we will explain our heuristic for determining the right memory overcommit limit. Hopefully, this will help you find the right setting for your workloads.
For most processes, handling an OOM kill is simple: the process restarts, reconnects, and picks up where it left off. PostgreSQL is different. PostgreSQL's postmaster (its main supervisor process) forks a backend process for each connection. These backends share memory segments that hold shared buffers, WAL buffers, lock tables, and other shared state. The OOM killer doesn't understand this architecture. It simply picks a process based on an heuristic (usually the process that uses the most memory) and terminates it. If that backend was modifying a shared memory segment, the segment may be left in an inconsistent state. Shared memory has no transactional guarantees at the OS level. A half-written page in shared buffers means silent data corruption. PostgreSQL's postmaster knows this. When it detects that any of its child processes has been killed, it assumes the worst: shared memory may be corrupted. When shared memory is corrupted, there is a risk of corrupting the stored data as well. To prevent this, the postmaster terminates all remaining backends. Every active connection is dropped. Every in-flight transaction is aborted. On its next start, the database goes through crash recovery. This is the correct behavior. PostgreSQL is protecting your data. But it means a single OOM kill doesn't just affect one connection. It takes down every connection on the server. On top of that, if the write volume was high, replaying all WAL files for crash recovery can take a long time. This means a single out of memory case can cause long outages.
Linux allows processes to allocate more virtual memory than what is physically available. When a process allocates memory, for example with malloc(), the kernel reserves virtual address space for it. However, the kernel does not immediately back that space with physical memory. Physical pages are only consumed when the process actually touches the memory. The kernel relies on the assumption that not all allocated memory will be actively used at the same time. Usually, this assumption holds. When it doesn’t, the kernel invokes the OOM killer to free memory by terminating a process.
When allocation fails with ENOMEM error code. PostgreSQL handles this gracefully. A backend that cannot allocate memory reports an error to the client, cancels the transaction, and continues. The postmaster stays up. Other connections remain unaffected. This is a routine error, not a catastrophe. The trade-off is that strict overcommit converts late, destructive failures into early, graceful ones. This trade-off works best when the machine is dedicated to PostgreSQL and a small set of known sidecar processes. In that scenario, the committed memory profile is predictable and the limit can be tuned with confidence. On shared machines running diverse workloads, committed memory becomes harder to predict. An unrelated process can use up the commit budget. This can make PostgreSQL get an ENOMEM error, even if the database load is fine.
Under strict overcommit, the kernel has two knobs to set CommitLimit: overcommit_kbytes and overcommit_ratio. The CommitLimit is calculated as:
It is possible to configure how the kernel behaves when processes ask for memory. Linux provides three overcommit policies via vm.overcommit_memory:
A Kernel Bug and 648 GB of Phantom Memory
We always favored strict overcommit for PostgreSQL. We used it in previous managed PostgreSQL services we built and also in Ubicloud PostgreSQL. However, after enabling it this time, we quickly ran into trouble. A few weeks after we turned on strict memory overcommit, we started to get failures on some of the databases. They showed out of memory errors, even though there was plenty of free physical memory on the machines. We disabled strict memory overcommit and started investigating.
Discovery The first clue came from a routine check of /proc/meminfo on one of our servers with 8 GB memory: $ > cat /proc/meminfo | grep "Committed_AS" Committed_AS: 683547672 kB 651 GB of committed memory on an 8 GB machine! For comparison, a healthy server of the same size showed: $ > cat /proc/meminfo | grep "Committed_AS" Committed_AS: 2703940 kB The counter was off by orders of magnitude.
... continue reading