What Does a Database for SSDs Look Like?

My name is Marc Brooker. I've been writing code, reading code, and living vicariously through computers for as long as I can remember. I like to build things that work. I also dabble in machining, welding, cooking and skiing.I'm currently an engineer at Amazon Web Services (AWS) in Seattle, where I work on databases, serverless, and serverless databases. Before that, I worked on EC2 and EBS.All opinions are my own.

Over on X, Ben Dicken asked:

What does a relational database designed specifically for local SSDs look like? Postgres, MySQL, SQLite and many others were invented in the 90s and 00s, the era of spinning disks. A local NVMe SSD has ~1000x improvement in both throughput and latency. Design decisions like write-ahead logs, large page sizes, and buffering table writes in bulk were built around disks where I/O was SLOW, and where sequential I/O was order(s)-of-magnitude faster than random. If we had to throw these databases away and begin from scratch in 2025, what would change and what would remain?

How might we tackle this question quantitatively for the modern transaction-orientated database?

But first, the bigger picture. It’s not only SSDs that have come along since databases like Postgres were first designed. We also have the cloud, with deployments to excellent datacenter infrastructure, including multiple independent datacenters with great network connectivity between them, available to all. Datacenter networks offer 1000x (or more) increased throughput, along with latency in the microseconds. Servers with hundreds of cores and thousands of gigabytes of RAM are mainstream.

Applications have changed too. Companies are global, businesses are 24/7. Down time is expensive, and that expense can be measured. The security and compliance environment is much more demanding. Builders want to deploy in seconds, not days.

Approach One: The Five Minute Rule

Perhaps my single favorite systems paper, The 5 Minute Rule… by Jim Gray and Franco Putzolu gives us a very simple way to answer one of the most important questions in systems: how big should caches be? The five minute rule is that, back in 1986, if you expected to read a page again within five minutes you should keep in in RAM. If not, you should keep it on disk. The basic logic is that you look at the page that’s least likely to be re-used. If it’s cheaper to keep around until it’s next expected re-use, then you should keep more. If it’s cheaper to reload from storage than keep around, then you should keep less1. Let’s update the numbers for 2025, assuming that pages are around 32kB2 (this becomes important later).

The EC2 i8g.48xlarge delivers about 1.8 million read iops of this size, at a price of around $0.004576 per second, or $10^{-9}$ dollars per transfer (assuming we’re allocating about 40% of the instance price to storage). About one dollar per billion reads. It also has enough RAM for about 50 million pages of this size, costing around $3 \times 10^{-11}$ dollars to storage a page for one second.

... continue reading