Milestone moment: The 100B diffusion language model achieved a speed of 892 Tokens/second, indicating that an alternative path for AI has been successfully explored.
The diffusion language model (dLLM), a research direction once considered a "niche track," has finally achieved a qualitative change.
Last Monday, LLaDA2.1 was quietly launched on HuggingFace, just two months after the release of the previous version, LLaDA2.0. This release includes two versions: LLaDA2.1-Mini (16B) and LLaDA2.1-Flash (100B).
As the benchmark in this field, every iteration of LLaDA affects the development of the entire direction. This time, LLaDA2.1 has almost single - handedly completed the "coming - of - age ceremony" of the diffusion language model. The peak speed of 892 Tokens/second has made the theoretical efficiency advantage a reality for the first time. The mechanism of correcting errors while generating has broken the curse of "fast but inaccurate." Coupled with the switchable dual - mode and the first successful post - training of reinforcement learning... These signals are clear: this once - niche academic route has grown into a powerful and truly useful tool, even more efficient.
To date, autoregressive models that generate the next token one by one remain the mainstream. However, in long - text generation, high computational costs and slow inference speeds are just the obvious problems. The real and often overlooked issue is that the model can only make one - way guesses forward, unable to see the subsequent context, and once it makes a mistake, it cannot go back to correct it. Errors accumulate like a snowball. These difficulties are like an elephant in the room, always blocking the way to large - scale applications.
LLaDA2.1's solution is straightforward: instead of making minor improvements within the old framework, it's better to change the underlying logic. Let the model generate in parallel like a "cloze test" and refine repeatedly, turning "no regret after writing" into "correcting while writing."
How this mechanism works specifically can be found in the technical report jointly written by Ant Group, Zhejiang University, Westlake University, and Southern University of Science and Technology.
- Paper URL: https://github.com/inclusionAI/LLaDA2.X/blob/main/llada2_1_tech_report.pdf
- Hugging Face: https://huggingface.co/collections/inclusionAI/llada21
- ModelScope: https://modelscope.cn/collections/inclusionAI/LLaDA21
- GitHub: https://github.com/inclusionAI/LLaDA2.X
- Tech Report: https://huggingface.co/papers/2602.08676
Another Path Beyond Autoregression
To understand the breakthrough of LLaDA2.1, we must start with the "underlying logic conflict" of current AI models.
In the world of mainstream large AI models (such as GPT and Claude), the autoregressive architecture is the absolute dominant force.
It follows a strict paradigm of generating one token at a time: each output step becomes a fixed condition for the next step. The generation path is like a one - way railway track. Once written, it cannot be retraced. For example, if the model writes "One cannot step into the same river twice" but later realizes it should be "step into" instead of "walk into," it can only continue with the mistake.
This method has natural advantages in stability and controllability, but the cost is also obvious. Since the inference process is essentially sequential, the model has difficulty performing large - scale parallel decoding. The generation delay increases with the length of the context and the output scale, gradually becoming an important factor restricting inference efficiency and deployment costs. More importantly, this paradigm is structurally designed to be slow but stable, leaving little room for a significant increase in speed and throughput.
Based on this, diffusion language models have begun to be regarded as an alternative route with potential breakthroughs. They no longer adhere to the left - to - right approach but attempt to generate multiple tokens simultaneously in the global space.
However, high parallelism often comes with a high error rate. Early diffusion models usually adopted a fixed "mask - to - token" (M2T) path. Although this mechanism is fast, it has disadvantages. Once the model lacks confidence in a generated token, it cannot correct it in subsequent steps, ultimately slowing down the overall inference speed and reducing the output quality.
This structural contradiction between "speed and quality" has kept diffusion language models in the research stage for a long time and difficult to be truly applied in practical systems.
In this context, LLaDA2.0 previously proposed by the Ant team has proven the feasibility of a diffusion language model with tens of billions of parameters in large - scale and parallel decoding. However, the paper also frankly pointed out that how to achieve a controllable and stable balance between speed and generation quality remains an unsolved problem.
LLaDA2.1 is a direct response to this core contradiction. Instead of simply increasing the number of parameters or improving the rankings, they made systematic adjustments to the decoding mechanism, training paradigm, and engineering system, enabling the diffusion language model to truly cross the threshold between being able to run and being useful.
The Path of dLLM is Cleared
Let's first look at the results: When dealing with complex programming tasks, the 100B (hundreds of billions) parameter version of LLaDA2.1 achieved an astonishing peak speed of 892 Tokens/second.
The premise that really makes this result worthy of attention is that this is a model with a scale of 100B.
For many researchers, how to "scale up" the dLLM is a well - recognized difficult problem. The mainstream practices in the industry include training from scratch, migrating capabilities from autoregressive models, and optimizing performance and efficiency in the post - training stage. The first two routes are limited by data scale, training efficiency, and computational costs, and the model scale generally remains within a few billion to 30 billion parameters. Although the post - training direction has achieved initial breakthroughs in code, planning, and inference acceleration, it is still in the early stage. How to coordinately expand and scale up to the scale of hundreds of billions of parameters remains an open question.
Therefore, the 100B scale of LLaDA2.1 itself has broken through the long - standing scale ceiling of this route. It is under this premise that the result of 892 Tokens/second is particularly crucial. It was not achieved on an easily accelerated small model but on the most difficult and heavy - scale range of diffusion models.
More importantly, this speed is not from simplified tasks or short - text generation. It appears in a complex programming benchmark like HumanEval+. In this scenario, the model not only needs to handle long contexts but also must maintain logical consistency and grammatical correctness. Inference efficiency is often the first indicator to be sacrificed.
Behind this lies a set of systematic adjustments made by the Ant team to address the long - standing bottlenecks of diffusion language models.
"Draft - Edit" like a Human Expert
First, LLaDA2.1 innovatively proposed the Error - Correcting Editable (ECE) mechanism. It can draft the entire answer in a lightning - fast sampling within milliseconds and then go back to check and correct.
Taking the example of "One cannot walk into the same river twice" mentioned above, when the model finds that "walk into" is inappropriate, it will immediately change it to "step into." This ability is beyond the reach of autoregressive models. LLaDA2.1 gets rid of the rigid "write - to - the - end" mode and divides the process into two steps:
- Step 1: Rapid Drafting. The model generates a "draft" in parallel at an extremely high speed, allowing a certain degree of uncertainty in this stage.
- Step 2: Intelligent Editing. Immediately start the "editing" mode to globally re - evaluate and self - correct the draft. If an error is found, go back and correct it. If a better expression is found, replace it immediately.
This paradigm covers two types of operations: direct decoding from mask to token and editing from one token to another. This strategy enables the model to directly refine its own output during the generation process, effectively solving the common local inconsistency problem in parallel decoding. To cultivate this editing ability, the team exposed the model to masked positions and random noise during the continuous pre - training (CPT) and supervised fine - tuning (SFT) stages, encouraging it not only to generate new content but also to identify and correct existing errors.
The key is that this architecture transforms the rigid trade - off relationship between latency and generation quality into a continuous space that can be flexibly configured by users. By allowing the model to correct errors retroactively in the generation results, the confidence threshold in the initial Mask - to - Token (M2T) stage can be significantly reduced without causing a collapse in generation quality.
Single Model with Dual Modes, Returning the Choice to Users
LLaDA2.1 also made a bolder design: one model supports two modes: quality mode and extreme - speed mode.
- Speedy Mode: Aggressively lower the confidence threshold for initial generation, quickly produce a draft, and rely on subsequent editing to ensure quality. It is suitable for scenarios such as code generation, rapid iteration, and brainstorming.
- Quality Mode: Adopt a conservative strategy, increase the quality requirements for initial generation, and reduce the number of errors that need to be corrected. It is suitable for formal documents, academic writing, and high - precision tasks.
Previously, LLaDA - MoE and LLaDA 2.0 required secondary development to provide additional accelerated versions, such as acceleration based on path distillation. Although these accelerated versions achieved a certain degree of acceleration compared to the basic version, the accuracy loss was generally serious. At the same time, multiple versions increased the difficulty for users to make choices and the cost of model management. The single - model dual - mode design avoids the above problems. Users can switch between quality and extreme - speed modes with just one configuration according to their actual needs.
Making the Model Understand Instructions
If the error - correcting editing mechanism makes the model usable, then reinforcement learning makes the model smarter, more reliable, and more user - friendly.
To further improve the model's ability, the team introduced a reinforcement learning stage into the training process. Although recent research works (such as SPG, TraceRL, and ESPO) have demonstrated the potential of reinforcement learning in improving the performance of diffusion language models, applying policy gradient methods to block autoregressive models still faces significant challenges due to the difficulty of accurately calculating the sequence - level log - likelihood.
To address this problem, the Ant team proposed and adopted an ELBO - based Block - level Policy Optimization (EBPO) method, which is specifically designed and adapted for the editable decoding structure.
More importantly, the team applied reinforcement learning to a diffusion model with hundreds of billions of parameters for the first time, enabling the model to better understand instructions and align with human intentions, rather than just pursuing speed.
What's the Actual Effect of LLaDA2.1, Which Can "Correct While Writing"?
The technological innovation has ultimately translated into tangible improvements in capabilities. In the experimental evaluation of LLaDA2.1, the evolution from architectural logic to execution efficiency is fully demonstrated.
Table 1 and Table 2 report the comparison results of LLaDA2.1 - flash and LLaDA2.1 - mini with other models, including performance scores and TPF (the number of tokens generated per forward inference). The experimental results show that in the S mode, the task score of LLaDA2.1 decreased slightly compared to LLaDA2.0, but the TPF increased significantly. In the Q mode, LLaDA2.1 outperformed LLaDA2.0 in both the mini and flash versions.
Table 3 further focuses on the speed performance of LLaDA2.1 in the S mode. It can be observed that there are significant speed differences between different task domains. The throughput rate is the highest for code - related tasks, while it is relatively low for instruction - following tasks. Specifically, after quantization, LLaDA2.1 - flash reached a peak speed of 891.74 TPS on the HumanEval+ benchmark, and the peak TPS of LLaDA2.1 - mini was as high as 1586.93, showing a significant advantage in inference efficiency.
As shown in Table 4, under the same S - mode setting, introducing Multi - Block Editing (MBE) can stably improve the performance of both the Flash and Mini model versions across multiple benchmarks, with only a small decrease in throughput.
Figure 3 further shows the comparison results of the throughput rate (tokens per second) between LLaDA2.1 and other models such as LLaDA2.0, Ling, and Qwen - 3 in the five task domains covered in Table 3. Overall, LLaDA2.1 shows an extremely prominent speed advantage in the S mode: achieving a significantly faster inference speed while sacrificing very little output quality.
Will There Be a Paradigm Shift in AI Architecture?
The significance of LLaDA2.1 may not lie in breaking a certain indicator record but in bringing an old and long - neglected problem back to the table.
In the past few years, autoregressive models have almost constituted the only practical path for the development of large language models. They are reliable, mature, and useful enough, so the industry has mostly continued to invest in this path without really stopping to discuss whether there are other options for the underlying form of language models.
LLaDA2.1 does not attempt