Defeating Nondeterminism in LLM Inference

Reproducibility is a bedrock of scientific progress. However, it’s remarkably difficult to get reproducible results out of large language models.

For example, you might observe that asking ChatGPT the same question multiple times provides different results. This by itself is not surprising, since getting a result from a language model involves “sampling”, a process that converts the language model’s output into a probability distribution and probabilistically selects a token.

What might be more surprising is that even when we adjust the temperature down to 0 This means that the LLM always chooses the highest probability token, which is called greedy sampling. (thus making the sampling theoretically deterministic), LLM APIs are still not deterministic in practice (see past discussions here, here, or here). Even when running inference on your own hardware with an OSS inference library like vLLM or SGLang, sampling still isn’t deterministic (see here or here).

But why aren’t LLM inference engines deterministic? One common hypothesis is that some combination of floating-point non-associativity and concurrent execution leads to nondeterminism based on which concurrent core finishes first. We will call this the “concurrency + floating point” hypothesis for LLM inference nondeterminism. For example, a recent arXiv preprint writes:

Floating-point arithmetic in GPUs exhibits non-associativity, meaning $(a + b) + c

eq a + (b + c)$ due to finite precision and rounding errors. This property directly impacts the computation of attention scores and logits in the transformer architecture, where parallel operations across multiple threads can yield different results based on execution order.

You can also find the “concurrency + floating point” hypothesis repeated by others, like here (“There are speed tradeoffs, and in order to make the endpoints fast GPUs are used, which do parallel [nondeterministic] calculations. Any modern GPU neural net calculations will be subject to these."), or here (“Because GPUs are highly parallelized, the ordering of additions or multiplications might be different on each execution, which can cascade into small differences in output.").

While this hypothesis is not entirely wrong, it doesn’t reveal the full picture. For example, even on a GPU, running the same matrix multiplication on the same data repeatedly will always provide bitwise equal results. We’re definitely using floating-point numbers. And our GPU definitely has a lot of concurrency. Why don’t we see nondeterminism in this test?

A = torch . randn ( 2048 , 2048 , device = 'cuda' , dtype = torch . bfloat16 ) B = torch . randn ( 2048 , 2048 , device = 'cuda' , dtype = torch . bfloat16 ) ref = torch . mm ( A , B ) for _ in range ( 1000 ): assert ( torch . mm ( A , B ) - ref ) . abs () . max () . item () == 0

To understand the true cause of LLM inference nondeterminism, we must look deeper.

... continue reading