AdapTive-LeArning Speculator System (ATLAS): Faster LLM inference

At Together AI, the AI Native Cloud, we’re obsessed with performance. Making large language models faster, cheaper, and more efficient is not a one-trick problem — it requires optimizing along multiple axes. That is the philosophy behind Together Turbo, our suite of inference innovations that draw from research in algorithms, architectures, and modeling recipes. We’re excited to introduce the AdapTive-LeArning Speculator System (ATLAS), the first speculator of its kind that gives automatic performance improvements without any manual tuning.

ATLAS offers a new way of doing speculative decoding — one that dynamically improves at runtime — and it fits seamlessly alongside our other Turbo techniques like the proprietary Together Turbo Speculator or Custom Speculators. But why create an adaptive-learning speculator system?

Standard speculators are trained for general workloads. Custom speculators are trained on your specific data, but only for a specific snapshot in time. However, as the workload evolves (codebase grows, traffic patterns shift, request distributions change), even highly customized speculators can fall behind. In contrast, ATLAS evolves automatically with usage, learning from both historical patterns and live traffic to continuously align with the target model’s behaviors in real time. This means the more you use our inference service, the better ATLAS will perform!

Built on top of Together Turbo Speculator, ATLAS reaches up to 500 TPS on DeepSeek-V3.1 and up to 460 TPS on Kimi-K2 in a fully adapted scenario — 2.65x faster than standard decoding, outperforming even specialized hardware like Groq (Figure 1).

Figure 1: We show the decoding speed on NVIDIA HGX B200 with our Turbo speculator and the adaptive-learning speculator system for DeepSeek-V3.1 (top) KIMI-K2-0905 (bottom) with Arena Hard traffic.1

1. Speculative Decoding

Speculative decoding is one of the most powerful levers for accelerating inference.2 Instead of having the target model generate every token step by step, a faster speculator (also known as the draft model) proposes multiple tokens ahead, and the target model verifies them in parallel in a single forward pass. The verification process ensures that the quality of the output matches the distribution of non-speculative decoding, while achieving speedups by accepting many tokens at a time.

The overall speed is influenced by the acceptance rate $α$ (i.e., how often the target model agrees with the drafted tokens from the speculator) and the relative latency $c$ of the draft versus the target. Typically, larger speculators with more parameters yield higher acceptance rates due to their higher capacity but are slower to generate draft tokens. Progress therefore comes from both sides: aligning draft and target models to increase $α$ (training objectives, data, and algorithms) and designing draft models/kernels that keep $c$ low while maintaining $α$ (sparsity, quantization, lightweight & kernel-efficient architectures). The sweet spot is where a high $α$ meets a low $c$, minimizing end-to-end latency.

Speculative Decoding: Performance Analysis SPECULATIVE DECODING SPEEDUP $$\text{Speedup Factor} = \frac{1 - \alpha^{(\gamma+1)}}{(1 - \alpha)(\gamma c + 1)}$$ Manipulate acceptance rate ($\alpha$) and draft speed ($c$) to observe performance characteristics acceptance rate ($\alpha$) 0.85 draft speed (c) 0.100

Figure 2: Interactive speedup calculator for speculative decoding. Different optimization techniques influence acceptance rate or draft latency, creating a complex optimization space to maximize speedup. The optimal lookahead (γ) itself varies with model configuration – strong speculators (high α, low c) continue gaining from longer lookaheads (γ = 5+), while weaker configurations (low α, high c) plateau early (γ = 3-4).

... continue reading