At Recall.ai, we run an unusual workload. We record millions of hours of meetings every month. Each of these meetings generates a large amount of data we need to reliably capture and analyze. Some of that data is video, some of it is audio and some of it is structured data – transcription, events and metadata.
The structured data gets written to our Postgres database by tens of thousands of simultaneous writers. Each of these writers is a “meeting bot”, which joins a video call and captures the data in real-time.
We love Postgres and it lives at the heart of our service! But this extremely concurrent, write-heavy workload resulted in a stalled-out Postgres. This is the story of what happened, how we ended up discovering a bottleneck in the LISTEN/NOTIFY feature of Postgres (the event notifier that runs based on triggers when something changes in a row), and what we ended up doing about it.
TL;DR
When a NOTIFY query is issued during a transaction, it acquires a global lock on the entire database (ref) during the commit phase of the transaction, effectively serializing all commits. Under many concurrent writers, this results in immense load and major downtime. Don’t use LISTEN/NOTIFY if you want your database to scale to many writers.
How we discovered it
Between the dates 2025-03-19 to 2025-03-22 our core Postgres database experienced three periods of downtime. Each of these downtimes had a similar set of symptoms:
Massive spikes in database load and active awaiting sessions
Database query throughput drops massively
Database CPU, disk I/O and network traffic all plummeted
... continue reading