We made Postgres writes faster, but it broke replication

We Made Postgres Writes Faster, but it Broke Replication +2 By Stu Hood, Ming Ying, Mathew Pregasen, and Olive Ratliff on June 30, 2025 When we built pg_search , the Postgres extension for search and analytics, write throughput was a priority. To be an effective alternative to Elasticsearch we needed to support high ingest workloads in real time. This is because many Elasticsearch use cases — like real-time dashboards, e-commerce search, and recommendation systems — involve continuous writes that must be indexed and made searchable immediately. In a vanilla Postgres implementation, full-text search is backed by a B-tree or GIN (Generalized Inverted Index) structure. These indexes are good for relatively fast lookups, but they aren’t so fast for writes. We opted for a data structure optimized for writes: a Log-Structured Merge (LSM) tree. That was great for write throughput, but it broke Postgres replication! Specifically, it broke physical replication, one of the two mechanisms that allow Postgres to replicate data from a primary node across one or more read replicas. The other mode is logical replication, which sends individual row changes to replicas instead of copying the database byte-for-byte. It turned out that Postgres's out-of-the-box support for physical replication, built on Write-Ahead Log (WAL) shipping, isn't quite enough for an advanced data structure like an LSM tree to be replication-safe. We were surprised to learn this, so we decided to write up our experience and describe how we fixed the problem. In this post, we'll do a deep dive into: What is an LSM tree? What it means for an LSM tree to be replication-safe How Postgres' WAL shipping guarantees physical consistency Why atomic logging was necessary for logical consistency How we leveraged a little-known but powerful Postgres setting called hot_standby_feedback What is an LSM Tree? A Log-Structured Merge Tree (LSM tree) is a write-optimized data structure commonly used in systems like RocksDB and Cassandra. The core idea behind an LSM tree is to turn random writes into sequential ones. Incoming writes are first stored in an in-memory buffer called a memtable, which is fast to update. Once the memtable fills up, it is flushed to disk as a sorted, immutable segment file (often called an SSTable). These segment files are organized by size into layers or levels. Newer data is written to the topmost layer. Over time, data is gradually pushed down into lower levels through a process called compaction, where data from smaller segments is merged, deduplicated, and rewritten into larger segments. What Do We Mean by Replication Safety? A reliable distributed datastore (one which guarantees “replication safety”) must demonstrate both physical and logical consistency12 across database replicas. Physical consistency means the replica contains structurally valid data — each page or block on disk is well-formed and corresponds to a state that did exist on the primary at some point. Logical consistency ensures that the data on the replica reflects a coherent and stable view of the database, something that could have been seen by a transaction on the primary. A physically consistent state is not always a logically consistent state. Specifically, if you take a snapshot of a physically consistent replica while replicating an in-flight transaction, it may not be logically consistent. A good analogy is to imagine replicating a book. Physical consistency is like copying every page exactly, even if you're in the middle of a chapter — you're guaranteed to have real pages, but you might end up with half a sentence or a missing footnote. Logical consistency is like waiting until the chapter is finished before copying it, ensuring the result makes sense to a reader. WAL Shipping: How Postgres Guarantees Physical Consistency In a primary-standby physical replication setup, a primary server is paired with a standby server that acts as a read-only copy of its leader. The servers remain in sync by using a Write-Ahead Log (WAL) to record all binary-level changes to storage blocks on the primary server before they occur. Changes to this append-only WAL file are then streamed to the standby (a process called “log shipping”) and applied in the order received. This process enables near-realtime data synchronization between the two servers, hence the phrase “hot standby”. Why Atomicity is a Requirement for Physical Consistency Atomicity is a requirement for physical consistency because Postgres locks are not replayed on replica servers. This is because replaying every lock taken on the primary would require strict timing synchronization, significantly impacting performance and hindering the ability of the standby to serve reads. Instead, the WAL uses per-buffer locks to incrementally replay edits in some particular order: it acquires an exclusive lock on the buffer (the in-memory representation of a block), makes its change, and then releases it. The problem arises when modifying data structures that span many Postgres buffers. Without the guarantee that operations are atomic over the entire structure, these modifications can lead to structural corruption. For example: pg_search uses an unrolled linked list of Postgres buffers where each node holds the read validity of a batch of segments in our LSM tree. To ensure that the primary can never observe a broken linked list, we use hand-over-hand locking (also known as lock coupling) to guarantee that the list remains physically consistent on the primary. After each buffer in the list is modified, its WAL entry becomes visible atomically on the replica. But what happens when we want to edit multiple entries in the list “at once” (atomically), such as when a newly compacted segment replaces multiple old segments? If only the primary server mattered, then we could protect the logical integrity of multiple list nodes by applying a global lock on the list itself, ensuring that the contents of the list were only visible in valid state. But replicas don’t have access to global locks, so there’s no way to coordinate edits across multiple nodes (and multiple buffers) at once. Instead, for multi-node operations pg_search uses a Copy-on-Write (CoW) clone of the list, and atomically swaps in the head. More generally, atomic operations insulate you from danger by eliminating reliance on coarse-grained locks. A Problem: Vacuums Undermine Logical Consistency Adapting algorithms to work atomically at the block level is table stakes for physical replication: if you don’t do it, your data structures are broken, and you won’t be able to use them consistently. But even when individual WAL operations and data structures are atomically compatible, VACUUM can interfere with the execution of concurrent transactions spread across multiple WAL entries and compromise logical consistency. To illustrate this problem, let's imagine the primary has a table with some number of rows. To ensure that concurrent writes can safely write without blocking each other, Postgres uses a mechanism called Multi-Version Concurrency Control (MVCC), which creates multiple versions of a modified row (or tuple) instead of updating the tuple in place. When a row is updated or deleted, the previous tuple is not immediately removed but instead flagged as “dead”. Those “dead” tuples remain on disk until a periodic maintenance operation called VACUUM is run. Like any other operation, VACUUMs are recorded in the WAL, shipped to the standby, and replayed there. Since the “deadness” of a tuple is local to the server but the VACUUM operation is replayed globally, faults can occur when a tuple is VACUUMed from the standby prematurely. The standby might be mid-read, iterating (across multiple WAL entries) over a tuple that it considers valid, while the primary might concurrently decide to VACUUM that tuple out of existence. The standby, lacking any locking coordination or awareness of concurrent operations, replays the VACUUM while the prior transaction is still in progress. This can lead to query failure if a long-running query attempts to access a VACUUMed tuple. Why LSM Trees are Particularly Vulnerable to this Problem If your Postgres is configured with read replicas and experiences a high volume of writes, you may have already seen this problem, even when using B-tree indexes. If a VACUUM is running on the primary at the same time that a query hits a read replica, it's possible for Postgres to abort the read. However, these errors are likely infrequent and tolerable in a typical Postgres setup where VACUUMs run once every few hours. The same cannot be said for an LSM tree, where compaction is a core, ongoing part of the system. In a high-write throughput system, compaction can happen many times per minute, even per second. This increases the number of opportunities for conflicts to occur. Similar to a VACUUM, a compaction rewrites data on the primary server, and needs to know when ongoing queries will no longer need that data in order to be able to safely delete old segments. The Logical Solution: Hot Standby Feedback This is where an optional Postgres setting called hot_standby_feedback comes in. When enabled, hot_standby_feedback allows the standby to tell the primary what data is safe to clean up from the replica’s point of view. This information significantly reduces the likelihood of a tuple being prematurely VACUUMed, and allows pg_search to determine when it is safe to delete segments. To understand what information hot_standby_feedback actually relays, we must first understand how tuple versioning works in Postgres. Every tuple in Postgres has two key metadata attributes: xmin and xmax . xmin stores the Transaction ID (XID) of the transaction that created or inserted that specific tuple version, while xmax stores the XID of the transaction that either updated or deleted the tuple version, effectively marking it as obsolete. When a tuple is deleted, the xmax value is updated with the XID of the deleting transaction. Since XID are assigned sequentially, later transactions are assigned a larger number for their XID, so another way to think about xmin is as a proxy for the tuple’s “creation time” and xmax for its “last updated or deleted time”. When hot_standby_feedback is enabled, the replica will periodically communicate the smallest xmin (oldest “creation time”) that any of its active queries is currently pinned to. This xmin identifies the oldest tuple still in use on the standby. Armed with this information, the primary can make smarter decisions about when to permit cleanup operations (i.e. VACUUMs). If it sees that a standby query is still operating on a tuple that would otherwise be considered “dead,” it can defer cleanup until that query has finished. Final Thoughts Even with the help of hot_standby_feedback , standby servers are fundamentally at the mercy of the WAL to provide instructions that are safe to execute in the order and moment they are received. The tension between the local incentives and global requirements is just one challenging dimension of the difficulty in achieving full replication safety in a distributed Postgres system. To achieve both physical and logical consistency, pg_search implements an atomically logged LSM tree, and to achieve logical consistency, we rely on hot_standby_feedback . This challenge was worth tackling because it enables the fastest possible search performance, without sacrificing consistency. To see it in action, check out our documentation or our open source project!

We made Postgres writes faster, but it broke replication

Share this article

Related Articles