Speculative KV coding: losslessly compressing KV cache by up to ~4×

The size of LLM context grows by the day. KV caching is what makes running those long contexts affordable: it trades compute for memory so the model doesn’t re-prefill work it has already done. But as agentic workflows push contexts ever longer, storing and moving the cache starts to dominate everything. To get to the next order of magnitude of LLM capability, we need it to be smaller.

You can make it smaller lossily. TurboQuant is a (somewhat controversial Performance exploration, accusations of academic misconduct.) recent example, dropping the bit-width of K and V and absorbing the resulting quality loss. The cost of that route is that the loss isn’t something you specify in advance: you find out what degraded by running evals and hoping they catch whatever you killed. Lossless compression sidesteps the question entirely, by reconstructing the cache exactly.

We explored lossless compression of LLM weights in a recent post. The numbers for KV cache look much the same: empirically, the bytewise entropy of a bf16 cache is about 11 bits per scalar, around 30% smaller than the raw representation. Worse, as we showed then That post was about weight compression, but KV cache is reasonably similar. See here for a nice paper., this slack collapses as the bitwidth comes down. FP4 weights are within 5-7% of saturating their format, and the same goes for caches stored at lower precision. Given that low bitwidths have such obvious benefits for performance, we ought to treat them as the baseline.

In this post we introduce Speculative KV coding, a method for losslessly compressing the KV cache of a large target model by up to ~ 4 × 4\times 4× This 4 × 4\times 4× is on top of lossy fp8 compression of the cache - the gross benefit is ~ 8 × 8\times 8×. using a cheaper predictor model — a faster model whose forward pass on the same prompt predicts what the target’s cache will be. By analogy with speculative decoding (Leviathan et al., 2022), the predictor runs in parallel on both encode and decode sides; an arithmetic coder then encodes the true cache at a bitrate set by how well the predictor fits the target.

A KV cache isn’t really random §

Quick refresher on entropy coding I’ve written about the mechanics of arithmetic coding before — see the rANS and tANS posts. For this post all you need is the bitrate formula above; the coder is a black box that achieves it.. You have a stream of symbols drawn from some distribution p p p. Shannon’s source coding theorem says the best any lossless coder can do, on average, is H ( p ) H(p) H(p) bits per symbol. Real coders work with a model distribution q q q instead of the true p p p, and end up paying H ( p , q ) = H ( p ) + K L ( p ∥ q ) H(p, q) = H(p) + \mathrm{KL}(p \,\|\, q) H(p,q)=H(p)+KL(p∥q) bits per symbol. The KL term is the slack: the price you pay for your model not being the truth.

The KV cache isn’t really a list of samples from a random source. The whole cache is the deterministic output of a forward pass through known weights on a known prompt. There’s exactly one tensor it can be. So the “true” distribution p p p is a delta on that tensor, and a delta has zero entropy. Every bit the coder spends is pure KL: H ( p , q ) = − ln ⁡ q ( K V t r u e ) H(p, q) = -\ln q(KV_\mathrm{true}) H(p,q)=−lnq(KVtrue). The bitrate is a direct measurement of how much weight your model q q q gives the correct KV cache.

What we need, then, is a calibrated model of the specific forward pass that produced one, in a form an arithmetic coder can consume.

What should q q q look like? §

Suppose for the moment that we had access to something that, given the prompt, produced a reasonable per-scalar prediction μ \mu μ of the KV cache and a calibrated sense σ 2 \sigma^2 σ2 of how wrong that prediction tends to be. The natural way to turn that into a real distribution is a Gaussian centred on μ \mu μ with variance σ 2 \sigma^2 σ2, so that q ( x ) = N ( x ; μ , σ 2 ) q(x) = \mathcal{N}(x;\, \mu, \sigma^2) q(x)=N(x;μ,σ2), in which case the cost (in nats per scalar) of encoding the true value splits cleanly into two terms:

... continue reading