HomeArticle

Milestone moment: The first 100B diffusion language model is here. The technical report reveals the details behind it.

机器之心2025-12-12 15:51
Have diffusion language models reached hundreds of billions of parameters? The technical report of LLaDA 2.0 is out.

Unexpectedly, the "Diffusion Large Language Model (dLLM)", which was a niche area at the beginning of the year, has now been scaled up to a model with hundreds of billions of parameters.

Some time ago, we discovered two new models on the HuggingFace page: LLaDA2.0-mini and LLaDA2.0-flash. They are from a joint team composed of Ant Group, Renmin University, Zhejiang University, and Westlake University, and both adopt the MoE architecture. The former has a total of 16 billion parameters, while the latter has a staggering 100 billion parameters — this is an unprecedented scale in the field of "Diffusion Large Language Models".

What's even more gratifying is that as the model grows larger, it also becomes stronger: in 47 benchmark tests covering several dimensions such as knowledge, reasoning, coding, mathematics, agents, and alignment, LLaDA2.0-flash scored an average of 73.18, on par with the powerful AR (Autoregressive) model Qwen3-30B-A3B-Instruct-2507 (73.60), and showed significant advantages in complex tasks such as coding (e.g., HumanEval, MBPP) and agents (BFCL).

For a long time, the autoregressive generation paradigm has always dominated the field of large models. This method of generating the next token sequentially from front to back was once highly anticipated. However, its inherent drawbacks have gradually emerged: the computational cost of long - text generation is high, the inference speed is slow, and it is difficult to capture the bidirectional dependencies between tokens. Once an error occurs in the previously generated content, it cannot be directly corrected, and the subsequent output will also be affected, ultimately leading to error accumulation.

The successful scaling of dLLM has shown people the feasibility of an alternative path. More notably, the rapid evolution of this type of model does not follow a single route to continuously increase the scale, but rather stems from the "multi - pronged" exploration of researchers.

Just in September this year, the researchers of the LLaDA series of models verified the feasibility of training a dLLM from scratch under the MoE architecture and launched the 7B LLaDA - MoE, which provided a new implementation method for the diffusion paradigm. And just three months later, the team made a breakthrough in another route — smoothly migrating from a mature AR model to the diffusion framework — and directly scaled the model to 100 billion parameters.

Demonstration of the generation effect of LLaDA2.0. It can be seen that the model generates in parallel at multiple positions, and the generated content can be modified.

What are the key technical choices behind this? Which methods work in dLLM? In the recently released technical report, the team behind LLaDA2.0 disclosed many details.

  • Report title: LLaDA2.0: Scaling Up Diffusion Language Models to 100B
  • Report link: https://github.com/inclusionAI/LLaDA2.0/blob/main/tech_report.pdf
  • HuggingFace link: https://hf.co/collections/inclusionAI/llada - 20

Scaling dLLM to 100B — A Widely Acknowledged Difficulty

Recently, research has found that when the amount of data is insufficient, dLLM becomes stronger with more training and will eventually outperform traditional autoregressive models. The longer the training, the more obvious the advantage. If the data is more abundant or of higher quality, this "outperformance" will come later; if the model is larger, the outperformance will come earlier.

The emergence of these evidences has made "training dLLM" increasingly attractive. However, scaling dLLM up is a widely acknowledged difficult problem. In the past one or two years, researchers have tried many methods.

Firstly, training from scratch. Previous models such as LLaDA and LLaDA - MoE are successful attempts in this direction, proving that the performance of a dLLM trained from scratch can be close to that of an AR model of the same size, and after adding MoE, the dLLM can be more efficient and powerful. However, limited by factors such as the available data volume, infrastructure maturity, computational cost, and training cycle, dLLMs trained from scratch are usually small in scale (≤8B) and still lag behind the most advanced AR models in overall performance.

Secondly, starting from a pre - trained AR model and enabling the dLLM to inherit its knowledge and capabilities, thereby reducing the training cost and narrowing the performance gap. There have been several representative works in this direction, including DiffusionLLaMA, Dream - 7B, RND1, Block DLM, etc. They use methods such as mask annealing and block diffusion to "transfer" the pre - trained language capabilities of AR models to the diffusion structure. However, these attempts have not exceeded the 30B scale. Coupled with the low training efficiency of block diffusion itself, it is difficult to directly extend this method to the training of large - scale models on massive corpora.

Finally, efforts in the post - training stage. In terms of fine - tuning, existing work has proven that after SFT, dLLM can be comparable to top - level AR models in tasks such as code generation and complex planning. In terms of reinforcement learning, since the log - likelihood of dLLM is difficult to calculate, researchers have had to design new algorithms, and even trained the first dLLM with long - chain thinking and reasoning capabilities. In terms of inference acceleration, through dynamic pruning or a hybrid AR - diffusion paradigm, the inference speed of dLLM has exceeded that of AR models of the same scale for the first time. However, overall, post - training research is still in its infancy, and how these technologies can be coordinated and extended to models with hundreds of billions of parameters remains an open question.

The emergence of the LLaDA2.0 model provides a solution to these problems.

LLaDA2.0 Provides a Better Solution for the Stable Training of Diffusion Models with Hundreds of Billions of Parameters

Different from previous models such as LLaDA - MoE, LLaDA2.0 did not choose to train a dLLM from scratch. Instead, it "smoothly" transformed an existing AR model into a diffusion model and conducted large - scale training and alignment on this basis.

To achieve this transformation, LLaDA2.0 proposed a systematic solution. From the reconstruction of the training paradigm, the enhanced collaboration between pre - training and post - training processes, to the adaptation and optimization of training and inference infrastructure, it provides a unique implementation path different from previous methods.

Overall, LLaDA2.0 efficiently achieved the goal of transforming an AR model into a dLLM by constructing a segmented and scalable training system.

As shown in Figure 2 below, first, through Continuous Pre - training (CPT), an AR base model is reconstructed into a Masked Diffusion Language Model (MDLM), enabling it to learn bidirectional denoising capabilities, thus smoothly transitioning to the diffusion paradigm while maintaining the geometric structure of the original AR model's representation.

Next, Block Diffusion Pre - training is introduced on the basis of the trained MDLM. At this time, the model no longer focuses on individual tokens but trains to denoise continuous text segments (i.e., blocks). The shift from tokens to blocks significantly enhances the long - range consistency of generation and brings higher computational efficiency.

Finally, after having both token - level and block - level AR generation capabilities, the model acquires stronger human intention and instruction - following characteristics through post - training (including SFT and DPO) and can better meet the requirements of downstream tasks. After this stage, the powerful generation capabilities obtained during the diffusion pre - training process can be efficiently translated into performance in actual tasks.

Training flow chart of LLaDA2.0.

Next, we will analyze these key steps one by one.

Continuous Pre - training

Due to the natural difference between the causal modeling method of AR models and the bidirectional denoising mechanism of dLLM, the transformation from the former to the latter cannot be achieved simply by replacing the training objective. Therefore, LLaDA2.0 adopted a Warmup–Stable–Decay (WSD) continuous pre - training strategy.

During the Warmup stage, the team regarded AR base models such as Ling - mini - 2.0 (16B) as the starting point of a Block Diffusion Language Model (BDLM) with a block size of 1 and gradually increased the block size according to the sequence "1→4→32 → 64 → 4096". Each adjustment of the block size is trained on medium - scale data to ensure a smooth transition of the model. When the block size reaches the maximum of 4096, the BDLM is transformed into a standard Masked Diffusion Language Model (MDLM), completing the structural migration from causal generation to global bidirectional denoising.

Next is the Stable stage. After the block size is fixed at 4096 and the model is transformed into the global bidirectional denoising paradigm, the MDLM is trained on a large - scale corpus to master diffusion - based generation and bidirectional context modeling capabilities.

After completing the MDLM training, the Decay stage begins. The team gradually reduces the block size from 4096 to a size more suitable for inference (e.g., 32), thus converting back to an efficient BDLM. In this way, the global context knowledge learned by the model during the MDLM stage is distilled back into a more compact block - level structure, enabling the model to have both the bidirectional semantic capabilities of diffusion and the inference efficiency of block - level generation.

In addition, since multiple documents are concatenated into long sequences during training, this will cause long - range dependencies between semantically unrelated texts. To address this, the team introduced a Document - level Attention Mask, which can avoid such cross - document interference, prevent semantic pollution, and ensure the stability of bidirectional modeling.

To further enhance the generalization and robustness of the BDLM, the team adopted a Top - k checkpoint fusion strategy. After pre - training, the k model checkpoints with the best performance are selected based on validation metrics such as perplexity, and the parameters such as weights and biases are arithmetically averaged to obtain a more robust BDLM initialization.

Through this entire process, LLaDA2.0 provides a stable and referable solution for the training of diffusion models with hundreds of billions of parameters in the industry.

Post - training

After completing the continuous pre - training from the AR to the dLLM paradigm, LLaDA2.0 also conducted systematic post - training, mainly including the following three core steps.

Firstly, SFT (Supervised Fine - Tuning): After the pre - training stage, SFT is used to align with user instructions. Several key improvements are introduced in the process: aligning the sequence length with blocks to ensure compatibility with the block - level attention structure; using the "Mask ratio bandwidth" to avoid ineffective training and gradient instability caused by samples with almost no noise or excessive noise; using "Complementary Masking" to ensure that all tokens in the same sequence are learned at least once in a training batch, significantly improving sample utilization and convergence speed; covering three types of data: inference, general, and industrial, to ensure a balanced distribution of model capabilities.

Secondly, CAP (Confidence - Aware Parallel Training): By adding an additional confidence loss during training, CAP introduces an entropy - minimization objective for correctly predicted tokens, improving the model's prediction confidence and enabling faster parallel decoding, achieving a good balance between generation quality and inference speed.

Thirdly, DPO (Direct Preference Optimization): It enables the model to better align with human preferences. The team constructed a preference dataset covering multiple fields such as general knowledge, mathematics, and instruction - following, containing a total of 1.5 million pairs of preference samples. In addition, the Evidence Lower Bound (ELBO) of the reconstruction loss is used as a substitute for the log - likelihood to construct a DPO framework suitable for diffusion models.

Through the collaboration of these three post - training technologies, LLaDA2.0 forms a comprehensive optimization system among ability shaping, inference efficiency improvement, and human preference alignment, enabling it to move from a general diffusion - based generation model to a high - performance and practical large model.

Training and Inference Infrastructure

To further address issues such as training stability, large - scale scalability, and inference efficiency, LLaDA2.0 conducted targeted engineering optimizations and mechanism designs during the pre - training, post - training, and inference stages respectively.

During the pre - training stage, the team used Megatron - LM as the training backend and combined multi - parallel strategies including data parallelism (DP), pipeline parallelism (PP), tensor parallelism (TP), context parallelism (CP), and expert parallelism (EP), enabling models with hundreds of billions of parameters to maintain high throughput and strong scalability under long sequences and complex attention structures.

In addition, by introducing a cuDNN - based attention implementation, the team significantly accelerated arbitrary block diffusion training. When training LLaDA2.0 - mini, compared with the non - fused attention implementation in TransformerEngine, this approach achieved an end - to - end acceleration of over 1.3 times and saved over 90% of the memory in the attention layer. The team also solved the numerical instability problem in the early stage of diffusion training by adding independent Gaussian noise to the output of the "masked token embedding".