Sorted string tables (SST) from first principles

This blog is about how data is laid out on disk, specifically about the details of Sorted String Tables (SSTs). Let’s cut to the chase.

SSDs and memory

First, it’s important to understand how data gets from disk in to usable memory. Most cloud instances use SSDs, so we’ll focus on that instead of spinning disks.

If you haven’t read the initial frameworks blog post of this series I recommend starting with that. In this context, the main point to takeaway is that not all read amplification is created equal. All databases need to abide by the laws of physics: fetching data that’s already in memory is hundreds of times faster than fetching data from over the network. Data structure design is all about minimizing the amount of data you need to fetch from expensive storage tiers while reducing the memory overhead (this an application of the RUM conjecture).

pages on storage devices

A high performance data system needs to reduce the amount of unnecessary bytes read to serve a query, reading only the necessary (”hot”) data.

An ideal system would read exactly the bytes it needs for the data it wants, but the fundamental unit of I/O isn’t a byte. On SSDs, it’s a page (typically 4KB). This means that whether you request a single byte, a hundreds bytes, or four thousand bytes from your disk you’ll still get the 4KB.

You can verify this on your own machine:

# on MacOS, check the size of the page on your disk > stat -f %k / 4096

To see this in action, I ran an experiment measuring read latency for 1KB vs 4KB reads using Direct I/O (bypassing the OS page cache). Despite requesting 4x less data, the 1KB reads took about the same time as the 4KB reads (about 9.2µs) on my SSD:

... continue reading