Tech News
← Back to articles

ANN v3: 200ms p99 query latency over 100B vectors

read original related products more articles

ANN v3: 200ms p99 query latency over 100 billion vectors January 21, 2026 • Nathan VanBenschoten (Chief Architect)

The pursuit of scale is not vanity. When you take existing systems and optimize them from first principles to achieve a step change in scalability, you can create something entirely new.

Nothing has demonstrated that more clearly than the explosion in deep learning over the past decade. The ML community took decades-old ideas and combined them with advancements in hardware, new algorithms, and hyper-specialization to forge something remarkable.

Both inspired by the ML community and in service of it, we recently rebuilt vector search in turbopuffer to support scales of up to 100 billion vectors in a single search index. We call this technology Approximate Nearest Neighbor (ANN) Search v3, and it is available now.

In this post, I'll dive into the technical details behind how we built for 100 billion vectors. Along the way, we’ll examine turbopuffer’s architecture, travel up the modern memory hierarchy, zoom into a single CPU core, and then back out to the scale of a distributed cluster.

Latency QPS Loading chart data... Loading chart data...

Billion-scale ANN search

Let’s look at the numbers to get a sense of the challenge: 100 billion vectors, 1024 dimensions per vector, 2 bytes per dimension ( f16 ). This is vector search over 200TiB of dense vector data. We want to serve a high rate (> 1k QPS) of ANN queries over this entire dataset, each with a latency target of 200ms or less.

With a healthy dose of mechanical sympathy, let’s consider how our hardware will run this workload and where it will encounter bottlenecks. If one part of the system bottlenecks (disk, network, memory, or CPU), other parts of the system will go underutilized. The key to making the most of the available hardware is to push down bottlenecks and balance resource utilization.

turbopuffer’s architecture is simple and opinionated. This simplicity makes the exercise tractable. turbopuffer’s query tier is a stateless layer on top of object storage, consisting of a caching hierarchy and compute. That’s it.

... continue reading