HomeArticle

Wang Mengdi's team launches TraceRL: Towards "Unified RL" for diffusion language models

学术头条2025-09-15 17:45
Princeton University proposed the TraceRL framework, enabling diffusion-based language models to outperform autoregressive models in mathematical reasoning.

Given the limitations of autoregressive large language models (LLMs) in computational efficiency and throughput, diffusion language models (DLMs) are receiving increasing attention.

However, there is currently a lack of a unified and effective reinforcement learning (RL) framework applicable to various DLM architectures (such as full-attention DLMs and block-attention DLMs) in the industry. Meanwhile, existing research has also overlooked the importance of aligning inference trajectories with training objectives.

Recently, the research team led by Professor Mengdi Wang from Princeton University proposed a “trajectory-aware RL” framework called TraceRL in a latest study. It can be applied to both full-attention and block-attention models and enables rapid optimization.

Paper link: https://arxiv.org/abs/2509.06949

Notably, the 4B DLM model trained with TraceRL outperformed the 7B autoregressive model on multiple complex mathematical reasoning tasks.

Through curriculum learning, they also introduced the first long chain-of-thought diffusion language model (long-CoT DLM), which achieved an 18.1% relative accuracy improvement on MATH500 compared to Qwen2.5-7B-Instruct.

Figure | Left: RL training dynamics of different methods, where TraceRL achieves the best optimization. Right: Comparison of complex mathematical reasoning tasks evaluated based on KV cache and LiveCodeBench-V2 benchmark test results.

Meanwhile, they also proposed a diffusion-based value model to reduce variance and improve training stability. They also explored broader potential applications of TraceRL, such as increasing the model's block size and accelerating inference.

In addition, they open-sourced a fully integrated framework called dLLM-RL for building, training, and deploying DLMs across different architectures. This framework includes implementations of various post-training methods and accelerated KV cache technologies, supporting both reproducible research and practical applications.

Address: https://github.com/Gen-Verse/dLLM-RL

Urgent Need to Solve the “Mismatch” Problem of DLMs

The research team emphasized that there is a significant mismatch between the objectives adopted by DLMs during the post-training phase and the trajectories followed during actual inference (text generation). Standard training methods, such as full random masking, can achieve parallel decoding but ignore the inherent context-dependent sequential logic of language. This disconnect between training and inference behaviors leads to inefficient model optimization.

To illustrate this difference, they first demonstrated through experiments that the semi-autoregressive fine-tuning method, which trains the model to generate subsequent content based on previous context, significantly outperforms the full random masking method in terms of optimization performance, even under the same computational load. This indicates that aligning training objectives with inference patterns is crucial.

To further verify the importance of alignment, they collected the model's “preferred inference trajectories”, i.e., the actual step sequences followed by the model during content generation. The experimental results showed that fine-tuning using these real inference trajectories achieved better performance than other baseline methods with lower or comparable computational costs.

Finally, although fine-tuning using preferred trajectories works well, collecting these trajectories requires a lot of additional work. In contrast, RL naturally generates these inference trajectories during its “rollouts” (model sample generation). Therefore, RL is a more practical and effective post-training strategy that can naturally utilize these trajectories to optimize the model.

TraceRL: “Small Diffusion Language” > “Large Autoregressive”

In this work, TraceRL focuses on the intermediate trajectories generated by DLMs and can be applied across architectures.

Figure | Overview of TraceRL. This example uses parameter settings of s = 2, L = 6, and B = 3. Trajectory-aware RL is achieved by aggregating every s adjacent steps. The numbers in the boxes correspond to the execution order of the policy inference process.

In terms of data, the research team used different data sources:

(1) Selected the Math training dataset to generate 8000 hardcore tasks;

(2) Chose GSM8K, MATH500, and AIME2024 as test benchmarks to evaluate reasoning tasks focused on mathematics and programming;

(3) In the coding reinforcement learning scenario, used 6000 verified questions provided by the PrimeIntellect platform for verification;

(4) For programming tests, selected LiveCodeBench-V2 and LiveBench as test benchmarks during evaluation.

Table | Main benchmark test results for different mathematics and programming tasks. “Static” refers to static sampling, and “Dynamic” refers to dynamic sampling. Here, the dynamic sampling method with a threshold of 0.9 is used to evaluate the long CoT model TraDo-8B-Instruct.

In terms of model training, it includes full-attention models and block-attention models, and the results of static sampling and dynamic sampling are reported simultaneously during the evaluation process. The process is as follows:

Step 1: Train the model separately with TraceRL;

Step 2: Jointly train the long CoT model;

Step 3: Conduct comparative experiments between TraceRL and other RL methods;

Step 4: Validate TraceRL in full-attention models and coding tasks;

Step 5: Conduct block size expansion experiments.

Based on the experimental results, they demonstrated the effectiveness and strong performance of TraceRL. The complete results are as follows:

First, they developed two models, TraDo-4B-Instruct and TraDo-8B-Instruct, using TraceRL based on the SDAR base model. In the evaluation of mathematics, programming, and 5 reasoning datasets, these models not only effectively compared with strong diffusion language models and autoregressive models but also showed significant advantages in generation ability.

Figure | Training curves of 4B and 8B models by TraceRL in mathematical tasks. The red curve represents the dynamic sampling accuracy, which has a faster sampling speed; the blue curve represents the static sampling accuracy, which can achieve higher accuracy. The 4B model is trained with a value model, while the 8B model is directly trained with Jpolicy.

TraDo-4B-Instruct demonstrated SOTA-level performance in reasoning tasks, proving the effectiveness of TraceRL. Whether it is dynamic sampling (faster) or static sampling (more accurate), the model's performance has been significantly improved. Notably, in all mathematical tasks, TraDo-4B-Instruct even outperformed strong autoregressive baseline models such as Qwen2.5-7B Instruct.

Although they adopted a dynamic sampling strategy during RL training, both the dynamic and static accuracies showed a steady upward trend, indicating that the model still has potential for expansion. This RL training significantly improved the model's mathematical reasoning ability:

In the MATH500 test, the static accuracy of the TraDo-4B Instruct model increased by 5.4%, and the dynamic accuracy increased by 4.2%. After optimization, it outperformed Qwen2.5-7B-Instruct. The TraDo-8B-Instruct model achieved increases of 4.2% in static accuracy and 4.8% in dynamic accuracy, respectively.

Figure | Ablation experiments of RL methods based on block diffusion models and mathematical RL tasks. The red and yellow curves correspond to the training results of TraceRL with and without the value model enabled, respectively. The blue curve uses a random masking target similar to the semi-autoregressive training method for intra-block training, and the green curve achieves additional training effects by adding complementary masks within the block.

They further conducted a comparative study between TraceRL and existing RL methods, focusing on block diffusion models. Although current RL methods are mainly developed for full-attention models, they directly adapted them to the block structure. For the random masking method, they restricted the sampling operation within each block to make it similar to the semi-autoregressive method. For coupled reinforcement learning, they introduced a complementary objective function in each training module to obtain more stable and efficient training results. The experimental results based on mathematical tasks showed that TraceRL exhibited the best performance regardless of the optimization strategy used.

Figure | Ablation experiments of RL training on the full-attention model Dream-7B-Coder-Instruct, focusing on coding tasks; the comparison between using and not using the value model shows that introducing the value model can effectively reduce fluctuations during the training process. All experiments were conducted on the mathematical tasks of the 4B model.

In addition, to verify the wide applicability of TraceRL, they also conducted experiments on the full-attention model in coding RL tasks. Based on Dream-7B-Coder-Instruct, RL training was carried out after fine-tuning with distilled data during the cold start phase. To accelerate the training process, the shrinkage parameter was set to s = 8. The experiments showed that TraceRL converged faster and had better performance.

Promising Future

Based on the above experimental results, the research team verified the effectiveness of TraceRL in different RL tasks. Meanwhile, they also demonstrated the advantages of TraceRL in accelerating inference and expanding block size, which provides promising directions for future research.

In particular, combining the accelerated inference ability of diffusion models with their potential strong reasoning ability represents an exciting research direction. Although current long CoT LLMs perform well on complex tasks, their inference time is too long. Such integration is expected to efficiently execute complex reasoning tasks in large-scale environments, opening up new application possibilities.

The research team also stated that the diffusion value model they proposed can integrate process rewards and provide a stronger supervision signal than single verifiable rewards. In the future, they will further explore the optimization of TraceRL based on process rewards.

This article is from the WeChat official account “Academic Headlines” (ID: SciTouTiao). Compiled by Xiaoyu. Republished by 36Kr with authorization.