Removing fsync from our local storage engine

How we used pre-allocation, O_DIRECT, and an SSD-aware journal to keep our local storage engine's writes crash-consistent without fsync.

Most storage engines pay fsync somewhere on the durable write path. We built a narrowly scoped single-node KV storage engine that does not call fsync for PUT or DELETE. The design relies on fixed-size preallocated files, pre-zeroed extents, O_DIRECT writes, and a journal whose commits are aligned to the SSD’s atomic-write unit.

This is not a general argument against fsync. It works because our durability contract is narrower than POSIX file semantics, our deployments are SSD-only, and the engine owns allocation, journaling, and recovery. In a 4KB random-write benchmark on AWS i8g.2xlarge local NVMe, the engine reached 190,985 obj/s versus 116,041 obj/s for ext4 + O_DIRECT + fsync.

The cost of fsync

Let’s start with how object stores and databases handle fsync today. MinIO, in both single-node and distributed deployments, eventually writes to the local filesystem. Each PUT issues fdatasync or fsync on the data part and on xl.meta, flushing the file to the device. RocksDB’s WAL doesn’t sync by default. Applications that need crash-consistent semantics have to opt in. etcd is stricter: every Raft entry is fsynced on the way to disk, and the etcd paper calls this out as essential to Raft safety. Postgres fsyncs the WAL at commit and uses group commit to amortize per-commit latency. Kafka is the outlier. By default it doesn’t fsync on the write path at all and leans entirely on replication for durability. The trade-off there is that single-node data safety is weakened. Data inside the power-loss window can be lost, and the cluster’s replication factor becomes the only line of defense.

Getting fsync right is hard. Jepsen has surfaced fsync-related data-loss bugs in distributed systems many times. One recent example is NATS 2.12.1 losing data on the crash-recovery path (Jepsen NATS 2.12.1 analysis).

Correctness aside, fsync is also expensive to call. A single fsync on an SSD typically takes a few hundred microseconds to a couple of milliseconds. Flushing data from the page cache to the device is just one part of that. The unpredictable part is metadata. fsync doesn’t just sync the file’s data, it syncs every piece of metadata the file depends on: inode, directory entry, extent map, all the way down to the filesystem journal commit. An fsync call that looks like it’s only touching a few KB can trigger an order of magnitude more I/O underneath.

Tail latency is even harder to control. Beyond the filesystem journal flush, the actual latency of any given fsync call also depends on concurrent I/O on the same device, the journal’s current commit progress, and the SSD’s GC activity. Any one of those can push latency several times above the median.

Why we built our own engine

The fsync cost above becomes painful because a filesystem-backed object path turns each durable write into a filesystem transaction: file data, inode state, directory entries, extent maps, and journal commits all become part of the critical path. If we wanted crash-consistent writes without paying that cost on every PUT, we had to move the write-ahead boundary out of the filesystem and into a storage engine we control.

... continue reading