Client-side load balancing at a million requests per second

How we built an in-process client-side load balancer for a million requests per second of internal fan-out traffic, what we layered on top (N-ring fade-in, occupancy-based bounded load, and AZ-aware routing with a latency health factor), and how hardening that path cut cost and made the service resilient to the infrastructure underneath it.

Our busiest API ran its high-volume internal traffic through the cluster's shared edge ingress load balancer. For years we could never be sure whether a latency spike came from our own code or from reusing that shared edge router internally.

In a previous post, we described how we built Zalando's Product Read API (PRAPI), serving millions of requests per second with single-digit-millisecond latency across 25 European markets. Every product page, search result, and checkout depends on it. A brief degradation has measurable impact on sales, resulting in high performance and availability requirements. The low latency is achieved through consistent-hash routing: Skipper, the cluster's edge load balancer, routes the same product ID to the same pod(s), helping to leverage pod-local caches in the underlying application. The routing infrastructure for this API matters.

On launch, Skipper handled both edge routing and the internal traffic between our batching and single-get components. It was always my intention that client-side load balancing (CSLB) would replace the latter, and I had hoped it would be a fast-follow. But Skipper was fast, adding only a couple of hundred microseconds to each request, and the team was understandably reluctant to introduce significant change to a working system. Over the years, as incidents accumulated where the root cause was never quite clear (Skipper, or PRAPI?), it became harder to ignore the structural problem. For a single batch-of-100 request, PRAPI had a 100x exposure to Skipper. When Skipper sneezed, PRAPI got the flu.

Some of those incidents, it would later turn out, were neither Skipper nor PRAPI. But we had no way to see that until we owned the routing decision, and the detailed logs that came with it.

Skipper and the Fan-Out Problem

Skipper is Zalando's open-source Kubernetes ingress controller and HTTP router. It handles edge load balancing brilliantly: consistent-hash routing, bounded load protection, fade-in for new pods. We contributed key features to Skipper ourselves, including minimising cache loss during scaling and preventing overload from hyped products. We still use Skipper for all single-product GET requests today.

The problem was our batch endpoint. PRAPI's product-sets component unpacks a single batch request into up to 100 parallel downstream calls to individual products pods. Each of those 100 calls transits Skipper. Skipper adds only a couple of hundred microseconds per hop, but a batch waits on up to a hundred of those hops at once, so its latency tracks the slowest of the hundred, not the typical one. And Skipper is shared infrastructure: we run on the same fleet as the rest of our cluster, on a global configuration we inherit rather than set.

Product-sets fan-out through the ingress load balancer

During incidents, we could never be certain whether latency spikes originated in Skipper or in our own code. It sat in the hot path of every request, we did not run it, and we could not cleanly separate its behaviour from ours. Even when Skipper was fast, that shared fate was the problem.

... continue reading