The fastest reasoning LLM, powered by diffusion
Today, we're introducing Mercury 2 — the world's fastest reasoning language model, built to make production AI feel instant.
Why speed matters more now
Production AI isn't one prompt and one answer anymore. It's loops: agents, retrieval pipelines, and extraction jobs running in the background at volume. In loops, latency doesn’t show up once. It compounds across every step, every user, every retry.
Yet current LLMs still share the same bottleneck: autoregressive, sequential decoding. One token at a time, left to right.
A new foundation: Diffusion for real-time reasoning
Mercury 2 doesn't decode sequentially. It generates responses through parallel refinement, producing multiple tokens simultaneously and converging over a small number of steps. Less typewriter, more editor revising a full draft at once. The result: >5x faster generation with a fundamentally different speed curve.
That speed advantage also changes the reasoning trade-off. Today, higher intelligence means more test-time compute — longer chains, more samples, more retries — bought at the direct expense of latency and cost. Diffusion-based reasoning gets you reasoning-grade quality inside real-time latency budgets.
Mercury 2 at a glance
Mercury 2 shifts the quality-speed curve for production deployments:
... continue reading