Postgres Postmaster does not scale

At Recall.ai we run an unusual workload. We record millions of meetings every week. We send meeting bots to calls so our customers can automate everything from meeting notes, to keeping the CRM up-to-date, to handling incidents, to providing live-feedback on the call and more.

Processing TB/s of real-time media streams is the thing we get asked most about. However an often-overlooked feature of meetings is their unusual synchronization. Most meetings start on the hour, some on the half, but most on the full. It sounds obvious to say it aloud, but the implication of this has rippled through our entire media processing infrastructure.

This is a picture of our load pattern. The y-axis is the number of EC2 instances in our fleet. Those large spikes are the bursts of meetings that we need to capture. And when the meeting starts, the compute capacity must be ready to process the incoming data, or it will be lost forever.

The extreme gradient of these spikes has resulted in us running into bottlenecks at almost every layer of the stack, from ARP to AWS. This is the story of a stubbornly mysterious issue, that led us to deeply examine postgres internals (again) and uncover an often overlooked postgres bottleneck that only rears its head at extremely high scale.

TL;DR

Every postgres server starts and ends with the postmaster process. It is responsible for spawning and reaping children to handle connections and parallel workers, amongst other things. The postmaster runs a single-threaded main loop. With high worker churn, this loop can consume an entire CPU core, slowing down connection establishment, parallel queries, signal handling and more. This caused a rare, hard-to-debug issue where some of our EC2 instances would get delayed by 10-15s, waiting on the postmaster to fork a new backend to handle the connection.

Slow connections to postgres

Months ago we got alerted to large spike of delayed EC2 instances. We immediately investigated only to find that all of them were actually ready and waiting. We initially suspected a slow query caused the delay but we ruled this out. Eventually we uncovered that the delay originated from additional time connecting to postgres.

Postgres has its own binary wire protocol. The client sends a startup message, to which the server responds with an auth request.

What we observed was truly bizarre, the client would successfully establish a TCP connection to postgres, however the startup message only receive a response after 10s. Here is a example what we saw:

... continue reading