When Sigterm Does Nothing: A Postgres Mystery

In my opinion, the worst bugs are the ones you decide not to look further into.

Maybe the problem only manifests 2% of the time, or when you run the code on a particular brand of toaster. Maybe the customer doesn't supply enough information to begin narrowing down the cause. Maybe you just don't have enough time to track the issue down fully.

And everyone eventually moves on. But when you see the same bug strike again months later, you quietly mutter to yourself and wish you'd been persistent the first time. This is one of those stories.

The ClickPipes team had encountered a bug with logical replication slot creation on Postgres read replicas—specifically, an issue where a query that was already taking hours rather than the few seconds it usually took couldn’t be terminated by any of the usual methods in Postgres, causing customer frustration and risking the stability of production databases. In this blog post, I’ll walk through how I investigated the problem and ultimately discovered it was due to a Postgres bug. We’ll also share how we fixed it and our experience working with the Postgres community.

At ClickPipes, our charter is simple. Allow customers to easily move lots of data from to ClickHouse Cloud, where X is an ever-growing set of object stores, queues, and databases. My team and I look after the database pipes. But before this, we were a startup named PeerDB, which made a product that moved lots of data from Postgres to instead. By being Postgres-focused, the team had significant experience with Postgres CDC at scale.

For the uninitiated, change data capture (CDC) is the process of continuously tracking database changes, allowing ClickPipes to replicate them to ClickHouse in near real-time. Postgres performs change data capture (CDC) primarily through logical replication slots. Logical replication slots decode changes from the write-ahead log (WAL) and stream them to consumers for replay. They act as an intermediary layer, transforming raw WAL data into consumable change events. The Postgres ClickPipe is built around logical replication, creating slots and using them to read from hundreds of customer databases.

When it all went awry, seconds before an unkillable query #

It all began when one of our biggest PeerDB customers pinged us, letting us know that they had set up a pipe to replicate data from their new Postgres read replica, but it appeared to have become "stuck." Keeping a few possibilities in mind, I pulled up their instance and saw nothing wrong—just a single active connection that was doing… something.

SELECT pid,backend_start,xact_start,wait_event_type,wait_event,state,query FROM pg_stat_activity WHERE query LIKE '%pg_create_logical_replication_slot%' AND pid != pg_backend_pid();

... continue reading