Making its first public statement after seven months of establishment, a unicorn valued at billions of US dollars releases a long article: Solving the non-deterministic problem of LLM inference
ThinkingMachines Lab has finally pulled out all the stops!
Just now, co-founder and former OpenAI vice president Lilian Weng revealed:
The first-generation flagship product of Thinking Machines is named "Connection Machine".
Here's what happened: Today, Thinking Machines launched a research blog column called "Connectionism" and published its first blog post titled "Defeating Nondeterminism in LLM Inference".
Thinking Machines introduced:
We believe that science becomes better through sharing.
The Connectionism column will evolve with our research, ranging from core numerical calculations to prompt engineering. Here, we share our progress and engage in frequent and open communication with the research community.
In addition, it was added that the name "Connectionism" can be traced back to the early days of AI. In the 1980s, this term referred to a subfield that specifically studied neural networks and their similarities to biological brains.
Lilian Weng dropped an even bigger bombshell. There's another reason why the column is named this way: the first-generation flagship model is called Connection Machine. Not only this blog post, but there's more great stuff coming!
Could it be that Thinking Machines is about to release a new model?
Before we look forward to the new LLM, let's first see what tricks Thinking Machines has up its sleeve and which research areas they focus on.
Portal: https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
According to Horace He, the main author of the blog post, this post mainly focuses on an important topic in his mind -
Reproducible floating point numerics in LLM inference.
The problem of nondeterminism in LLM inference
Reproducibility is the cornerstone of scientific progress. However, obtaining reproducible results from large language models is extremely difficult.
For example, you may notice that asking ChatGPT the same question multiple times can yield different results.
This isn't surprising in itself because obtaining results from a language model involves "sampling":
Converting the output of the language model into a probability distribution and probabilistically selecting a token.
What may be even more surprising is that even when we set the temperature to 0 (making the sampling theoretically deterministic), LLM APIs are still not deterministic in practice.
Even when running inference on your own hardware using open-source inference libraries like vLLM or SGLang, sampling is still not deterministic.
But why aren't LLM inference engines deterministic?
A common hypothesis is that some combination of floating-point non-associativity and concurrent execution leads to nondeterminism based on which concurrent core finishes first.
This study refers to this as the "concurrency + floating point" hypothesis for LLM inference nondeterminism.
For example, Chinese researchers Jiayi Yuan, Hao Li, Xinheng Ding, etc. recently uploaded an arXiv preprint that states:
Floating-point operations in GPUs exhibit non-associativity, meaning (a + b) + c ≠ a + (b + c) due to limited precision and rounding errors.
This property directly affects the calculation of attention scores and logits in the Transformer architecture, where parallel operations across multiple threads may produce different results depending on the execution order.
Portal: https://arxiv.org/abs/2506.09501
While this hypothesis makes some sense, it doesn't tell the whole story.
For example, even on a GPU, repeatedly running the same matrix multiplication on the same data always provides bitwise identical results.
We are indeed using floating-point numbers, and GPUs do have a large amount of concurrent computation. So why don't we see nondeterminism in this test? ⬇️
To understand the real culprit behind LLM inference nondeterminism, we must dig deeper.
Unfortunately, even defining the determinism of LLM inference is not easy.
Perhaps confusingly, the following statements can all be true at the same time:
Some kernels on the GPU are nondeterministic.
However, all the kernels used in the forward propagation of the language model are deterministic.
In addition, the forward propagation of LLM inference services (such as vLLM) can also claim to be deterministic.
However, from the perspective of the users of the inference service, the results are all nondeterministic.
This time, Thinking Machines decided to uncover the real culprit behind LLM inference nondeterminism and explain how to overcome nondeterminism in LLM inference and obtain truly reproducible results.
Key findings:
LLM forward propagation does not require atomic addition; the real source of its nondeterminism is "batch size variation" rather than "atomic competition".
To avoid nondeterminism in the inference service and to make the Transformer implementation batch-invariant, we must implement "batch invariance" in the kernel.
Fortunately, we can assume that each pointwise operation is batch-invariant. Therefore, we only need to worry about the 3 operations involving reduction - RMSNorm, matrix multiplication, and attention.
The implementation difficulty of these operations also increases. Each operation requires some additional considerations to achieve batch invariance with reasonable performance.
Batch-invariant RMSNorm: Data parallel RMSNorm
Ideally, we want to avoid communication between cores in the parallelization strategy.
One way to achieve this is to assign a batch element to each core, thus ensuring that each reduction is completely completed within a single core.
This is the so-called "data parallel" strategy because we are only parallelizing along a dimension that does not require communication.
Batch-invariant matrix multiplication: Data parallel Matmul
Similar to RMSNorm, the standard parallel strategy for matrix multiplication is a "data parallel" strategy that keeps the entire reduction within a single core.
The most straightforward way to think about it is to split the output tensor into two-dimensional tiles and assign each tile to a different core. Then, each core calculates the dot product belonging to that tile, again performing the entire reduction within a single core.
Unlike RMSNorm, additional constraints around arithmetic intensity and the use of tensor cores force us to split the two-dimensional tiles instead of individual output elements to achieve an efficient matrix multiplication kernel.
The key to solving this lies in the fact that you can view matrix multiplication as a pointwise operation followed by a reduction.
The simplest way to ensure that matrix multiplication is batch-invariant is to compile a kernel configuration and use it for all shapes.
Although there will be some performance loss, this is usually not catastrophic in large language model inference:
It only loses about 20% of the performance compared to cuBLAS.
Batch-invariant attention mechanism
After achieving batch invariance for matrix multiplication, the attention mechanism introduces two additional challenges - fittingly, because it contains two matrix multiplications.
Unlike RMSNorm and matrix multiplication, which only perform reduction on the feature dimension, reduction is now performed on both the feature dimension and the sequence dimension.
For the above reasons, the attention mechanism must handle various inference optimizations (such as chunked prefill and prefix caching) that affect how sequences are processed.
FlashAttention with KV cache breaks batch invariance. The root cause is that "cached KV" and "current KV" are calculated separately.
Different numbers of KV blocks → different mask/full block combinations → different reduction paths.
As long as the KV-cache page table is uniformly updated before kernel startup to ensure consistent KV layout at any time, this problem can be solved.
The attention shapes seen in large language model inference usually do require a split reduction kernel, commonly known as Split-KV or FlashDecoding.
A fixed number of Split-KV strategies (i.e., FlashDecode) unfortunately also <