Speculative decoding is one of the cleanest performance wins in inference optimisation: it’s lossless, it hits decode latency when not much else does, and in its standard formulation it’s simple and elegant.
It works by looking forwards: speculative decoding takes a position on what tokens will come next. For dense transformers the bet is riskless: accepted tokens pay off, rejected tokens cost nothing, a clean arbitrage on spare memory bandwidth.
A burst of research activity has recently pushed the envelope on how far forwards we can take that bet, for example Eagle 3.1, DFlash, SSD.
This post looks at two architectural shifts that have changed the underlying economics of speculation: what mixture-of-experts routing does to the decode roofline, and how compressed attention takes away the slack that used to make speculated tokens free.
Then it works through what they mean for when, and how far ahead, we should speculate.
The expert tax §
FFN layers in older, dense transformers (like the venerable Llama I wrote about this model before, here. series) have a simple roofline with batch size: arithmetic intensity climbs linearly with batch size as weights get reused across the batch, then flattens onto the compute ceiling.
The win for speculative decoding is clear. If you’re on the slope of the roofline you’re memory bound, and speculated tokens increase the amount of compute you’re doing without increasing the memory transfer. So both accepted & rejected tokens are free until they push you over the knee.
Modern models almost invariably With some interesting exceptions. use mixture-of-experts (MoE) layers in place of simple dense FFNs. Each token passes first through a ‘routing’ layer, which orders the relevant experts by affinity. The token hidden state is sent to the top k k k experts, then the results are recombined.
This routing means that the arithmetic intensity of the MoE layer can depend on the actual content of the hidden state inputs, not just the shape. In practice, one training objective (for training and large scale inference reasons) is to keep the experts balanced — that is, if B B B tokens come in, each expert of E E E total should process a fraction B / E B/E B/E of the total.
... continue reading