Building blobd: single-machine object store with sub-ms reads and 15 GB/s upload

Building blobd: single-machine object store with sub-millisecond reads and 15 GB/s uploads

For a past content platform, I used S3 for serving user content videos and documents, where there were lots of small range requests for streaming and seeking. Despite serving from same-region datacenters 2 ms from the user, S3 would take 30-200 ms to respond to each request. Even slight delays when jumping around quickly felt grating, and in UX every millisecond counts.

S3 also felt suboptimal for small objects, like thumbnails. They have the same high TTFB as large objects, and the overhead of requests begins to dominate in terms of throughput, pricing, and rate limits. For example, a dynamic webpage may have hundreds of thumbnails that need to be shown quickly. That might mean 100 billed GetObject calls, saturating S3 rate limits and internal connections 100x faster for a single page, and still feel unresponsive since users will decide and scroll past most in a few milliseconds. At small sizes, the time handling requests dominates the actual transfer time.

To improve on this, I set out to build a new object store from scratch optimized for a lot of low-latency random reads and small objects. It would be interesting to experiment with newer ideas:

Leverage newer things like io_uring, async Rust, and atomic writes.

I did not need to enumerate keys, so could I avoid tree-based data structures to gain constant time lookups?

Given bare metal hardware, could we use block devices and direct I/O, bypassing filesystems and kernel caches?

As any critical low-level system, keep it as simple as possible to set up, operate, and understand.

In terms of design trade offs, the object store would prioritize reads over writes. For writes, I prioritized creates over updates, and updates over deletes. Read performance should focus on latency, ensuring low constant latency regardless of object size or read offset/length. These fit the typical user content platform, where reads are more frequent and performance-sensitive than writes, and content typically grows over time — updates and deletes are rarer compared to creation.

From the physical limits perspective, modern NVMe SSDs can do hundreds of thousands of random I/O reads per second, and local DCs are just 1-5 ms from the user. How close can the object store get to these raw numbers?

... continue reading