Nvidia’s TiDAR experiment could speed up AI token generation using hybrid diffusion decoder — new research boasts big throughput gains, but limitations remain

As the AI race between companies, nations, and ideologies continues apace, Nvidia has released a paper describing TiDAR, a decoding method that merges two historically separate approaches to accelerating language model inference. Language models produce text one token at a time, where a token is a small chunk of text, such as a word fragment or punctuation mark.

Each token normally requires a full forward pass through the model, and that cost dominates the speed and expense of running today’s AI systems. If a model can safely produce several tokens per step without losing quality, it could lead to faster response times, lower GPU hours, and reduced operating costs per request, all of which could add up to substantial savings for operators running large AI deployments, running the latest Nvidia hardware.

The TiDAR study focuses on batch-one decoding and reports between four and six times higher token throughput than the Qwen2.5 and Qwen3 baselines used for comparison. The researchers evaluate 1.5 billion and 8 billion parameter models and show that speed gains can be achieved without measurable degradation on coding and math benchmarks. Although the work sits firmly in the research stage, it demonstrates why a GPU processing a single sequence can often compute more than one token’s worth of work per step without paying extra latency.

The paper joins a wave of research that attempts to exploit the imbalance between memory movement and computation during autoregressive decoding. On an H100, next-token generation is typically limited by the cost of loading model weights and KV cache from High Bandwidth Memory (HBM). Nvidia highlights this through a latency profile of Qwen3-32B: When the number of predicted token positions grows, total pass time barely shifts until the GPU becomes compute-bound.

Those unused regions of the token dimension effectively become “free slots”. TiDAR is built around the question of how much useful work a model can do inside those slots while preserving the shape of well-behaved next-token predictors.

Designed to satisfy two distributions at once

(Image credit: Microsoft / Nvidia)

TiDAR trains a single transformer to compute both an autoregressive next-token distribution and a diffusion-style marginal distribution in parallel. This is not how diffusion language models typically work. Prior systems such as Dream, Llada, and Block Diffusion rely entirely on parallel denoising of masked blocks. The benefit is speed, but accuracy drops as block lengths grow because the model no longer maintains a strict chain factorization. TiDAR attempts to recover that structure without giving up diffusion’s parallelism.

This is achieved with a structured attention mask dividing the input into three regions. The accepted prefix behaves like any causal sequence and provides keys and values that the model caches between steps. A block of previously drafted tokens uses bidirectional attention, letting the model verify them under the autoregressive distribution. A second block filled with mask tokens awaits the diffusion predictor, which proposes several new draft candidates in parallel.

Decoding then becomes a two-stage loop. First, the diffusion head fills the masked region. On the next pass, the model checks those drafts using its autoregressive head. Accepted tokens extend the prefix. Rejected ones are handled in the same step, because the model has learned to anticipate every acceptance path from the previous round. In that same pass, the diffusion head drafts the next block. The key to the scheme is that the prefix’s causal structure ensures the KV cache remains valid, solving one of the primary deployment problems faced by earlier diffusion-based decoders.

... continue reading