The EAGLE series — including EAGLE 1, EAGLE 2, and EAGLE 3 — has become one of the most widely adopted and practically deployed families of speculative decoding algorithms across both research and production systems.
Today, the EAGLE team, vLLM team, and TorchSpec team are excited to jointly introduce EAGLE 3.1 — a major step forward in speculative decoding robustness, efficiency, and deployability.
EAGLE 3.1 Innovations
While speculative decoding performs well in controlled settings, performance often degrades under different chat templates, long-context inputs, or out-of-distribution system prompts.
The EAGLE team traced this fragility to a phenomenon we call attention drift — as speculation depth increases, the drafter gradually shifts attention away from sink tokens and toward its own generated tokens.
We identified two underlying issues. First, the fused input representation becomes increasingly imbalanced as higher-layer hidden states dominate the drafter input. Second, hidden-state magnitude grows across speculation steps due to the unnormalized residual path. Together, these effects make the drafter progressively less stable at deeper speculation depths.
Figure 1: EAGLE 3 vs. EAGLE 3.1 architecture comparison. EAGLE 3.1 adds FC normalization after each target hidden state and feeds post-norm hidden states into the next decoding step.
To address this issue, EAGLE 3.1 introduces two key architectural improvements:
FC normalization after each target hidden state and before the FC layer
Feeding post-norm hidden states into the next decoding step
... continue reading