Tech News
← Back to articles

Consistency diffusion language models: Up to 14x faster, no quality loss

read original related products more articles

We introduce consistency diffusion language models (CDLM), which accelerates diffusion language model inference by combining consistency-based multi-token finalization with block-wise KV caching, achieving up to 14.5x latency speedups on math and coding tasks.

Diffusion Language Models (DLMs) are emerging as a promising alternative to autoregressive (AR) LMs. Instead of generating one token at a time, DLMs iteratively refine a partially masked sequence over multiple sampling steps, gradually transforming a fully masked sequence into clean text. This refinement process creates a compelling opportunity: it enables parallel generation, allowing the model to finalize multiple tokens per iteration and potentially achieve higher throughput than AR decoding. At the same time, it can exploit bidirectional context to unlock new capabilities such as text infilling and refinement.

Visualization of inference in CDLM, naive DLMs, and autoregressive (AR) models.

However, in practice, standard DLMs suffer from two major inefficiencies. [1]

KV caching incompatibility under full bidirectional attention.

Standard DLMs commonly use bidirectional (non-causal) attention, which requires recomputing attention over the full context at every denoising step, making inference expensive and preventing standard KV caching.

High refinement step counts to maintain quality.

High-quality generation typically requires many denoising/refinement steps, often comparable to the generation length. Naively reducing the number of steps tends to degrade quality sharply.

CDLM targets both bottlenecks through a post-training recipe that makes fewer-step inference reliable while enabling exact block-wise KV caching.

... continue reading