Das von Team der Tsinghua-Universität entwickelte TDRM: Ein geglättetes Belohnungsmodell basierend auf der zeitlichen Differenzlernen

Tsinghua Universität hat das TDRM-Framework vorgeschlagen, um die zeitliche Konsistenz des Belohnungsmodells zu verbessern und das Training des Verstärkungslernens zu optimieren.

The Reward Model (RM) plays a central role in both Large Language Model (LLM)-based Reinforcement Learning (RL) and validation during the inference process, and has already demonstrated outstanding performance in tasks such as mathematical problem-solving, code generation, and command following.

However, the existing Reward Model has a crucial deficit - the lack of temporal consistency, which causes problems such as "poor effectiveness of strategy updates" and "unstable Reinforcement Learning training processes".

Specifically, the reward for a step in an LLM's inference trace often has no connection with the rewards of neighboring steps. This leads to inconsistent and easily misleading signals during training and makes it difficult to provide effective guidance at the inference stage. These problems are particularly prominent in scenarios with long Chain-of-Thoughts (CoT) - the model receives no reward until it has completed a long series of inference steps and can therefore hardly judge "which step is useful and which is redundant".

To solve this problem, a team from Tsinghua University in collaboration with the California Institute of Technology proposed the TDRM framework - by minimizing the Temporal Difference (TD) during training to learn a smoother and more reliable Reward Model.

It is worth noting that all codes, data, and checkpoints of the language model have been released as open source on GitHub.

Publication link: https://arxiv.org/abs/2509.15110

GitHub address: https://github.com/THUDM/TDRM

The research results show that the Process Reward Model (PRM) trained by TD can achieve performance improvements of up to 6.6% and 23.7% in Best-of-N and tree search scenarios respectively.

In addition, the Process Reward Model trained by TD in combination with Reinforcement Learning with Verifiable Rewards (RLVR) can achieve Reinforcement Learning with higher data efficiency - it can achieve comparable performance with only 2,500 data that the baseline method requires 50,100 data for - and better language model strategies can be achieved on 8 model variants such as Qwen2.5-(0.5B, 1.5B), GLM4-9B-0414, GLM-Z1-9B-0414.

Development of a smoother and more reliable Reward Model

In contrast to previous methods that use Temporal Difference to create offline datasets for intermediate reward signals, TDRM uses Temporal Difference Learning to create a reliable Reward Model for Reinforcement Learning training, which generates a smoother reward space and denser reward signals.

As described in the publication, the TDRM framework includes the following three core modules:

Process Reward Model: The Process Reward Model is trained by n-step Temporal Difference Learning in combination with Reward Shaping.
Reinforcement Learning: Under the guidance of the trained Process Reward Model, online Reinforcement Learning is carried out to optimize the strategy update.
TDRM Integration: The process reward and the verifiable reward are effectively linearly combined and applied in the Actor-Critic type online Reinforcement Learning for different strategy model series and sizes.

Figure | Schematic diagram of the entire TDRM framework

The Temporal Difference method can iteratively optimize the estimation of strategy values by utilizing the dependencies between states. Specifically, the n-step Temporal Difference algorithm integrates the rewards and value estimates of the following n states and applies an exponential decay factor to future rewards. This motivates players to obtain rewards early and ensures a balance between short-term gains and long-term impacts of behavior.

1. Smoothness

Smoothness is an important property for effective Reward Model construction during the inference process, as it reflects the consistency and stability of value updates in intermediate steps and ensures that small changes in the inference trace do not cause disproportionate deviations in value estimates. To evaluate smoothness, the team used two complementary methods to compare the performance of ScalarPRM and TDRM.

Local Lipschitz constant: It is used to quantify the sensitivity of the reward to changes in neighboring states. The analysis shows that TDRM (0.2741) has a smaller average Lipschitz constant between consecutive steps compared to ScalarPRM (0.3331), indicating that its reward transitions are smoother and more temporally consistent;
TD error: By calculating the TD error between consecutive inference steps and combining it with the value differences between inference steps, the continuity and consistency of the estimation function are evaluated from two dimensions.

Figure | Comparison of the smoothness of Reward Models

However, previous studies have shown that the length of CoT in an LLM's inference process does not always increase stably. Based on the above analysis of reward smoothness, the research team believes that Reward Shaping is the key mechanism for stabilizing this emerging scaling behavior of length.

2. Reward Modeling

As described in the publication, Reward Shaping in the PRM framework based on Temporal Difference has two goals: on the one hand, it optimizes the Temporal Difference update by providing structured feedback, and on the other hand, it alleviates the volatility of reward signals at different inference lengths. These include:

Cosine Reward: A cosine-based reward function is implemented to account for the correctness and relative length of each inference step. It assigns different reward ranges to correct and incorrect steps. The reward starts at the maximum value and gradually decreases to the minimum value as the inference length approaches the maximum length.
Temporal Difference: The calculated Cosine Reward is combined with the Temporal Difference framework to update the Process Reward Model.
TD-λ: Compared to the n-step Temporal Difference algorithm, TD-λ is a more flexible online algorithm. Due to its online nature, TD-λ enables the Process Reward Model to immediately propagate information to earlier states after it observes a reward.
Loss function: To optimize the Process Reward Model, the cross-entropy loss function is used. The clipped Temporal Difference targets are used as soft labels for each inference step so that the model can learn from the temporal consistency of rewards.

3. Reinforcement Learning

In the field of Reinforcement Learning, the research team designed it as an online algorithm that dynamically uses current state values during training to calculate Temporal Difference targets. In contrast to offline algorithms that rely on pre-calculated state values, this method can adapt to constantly changing traces and use the seen traces to estimate the state values of unseen traces. This adaptability ensures more accurate value predictions and thus enhances the consistency and robustness of the Reward Model.

Figure | Processing process of the TDRM algorithm

In TDRM, the verifiable reward and the process-based reward are merged through linear combination to utilize their complementary advantages. This combined reward signal is used to train the GRPO objective function, thereby improving the overall performance and data efficiency of the learning process.

For more technical details, please refer to the publication.

What is the real effect?

To verify the effectiveness of TDRM, the research team tested the performance of TDRM in two scenarios: validation during inference and online Reinforcement Learning during training.

For validation during inference, different Reward Models are compared in two key scenarios. The Best-of-N sampling method first creates a pool of N potential outputs and then applies the Reward Model to determine the best single candidate item to achieve a balance between the diversity and optimality of outputs. Greedy Search generates the output by iteratively selecting the sequence with the highest score.
For online Reinforcement Learning during training, TDRM is compared with mainstream methods in 5 challenging datasets (MATH-500, Minerva Math, Olympiad Bench, AIME24 and AMC23). Based on the SimpleRL method, the Pass@1 index in combination with greedy decoding is used to evaluate the performance of the final task.

1. Reward Modeling

By observing the Best-of-N sampling results of different models and datasets, the research team provided empirical evidence for the superiority of TDRM.

First, TDRM shows significantly better results than ScalarPRM and ScalarORM on the MATH-500 dataset when the sampling budget is increased from Best-of-128 to Best-of-1024. This proves that TDRM is more reliable and can continuously identify the optimal answer even with a higher sampling budget.

Table | Test results for MATH-500; Best-of-128 results on GSM8K

In tree search evaluation, TDRM again shows better performance and provides more accurate validations of inference traces. In addition, the accuracy of TDRM increases with the number of search branches, indicating its effectiveness in navigating complex decision spaces.

Figure | Tree search results

2. Reinforcement Learning

TDRM has outperformed the 8 mainstream models on a limited dataset with only 2,500 MATH Level-3 hints and achieved the highest average accuracy, highlighting its reliability in Reinforcement Learning training.

TDRM ensures stable performance and better data efficiency by combining the verifiable reward and the process-based reward and can even continuously learn with limited training examples.

Table | Evaluation results on mathematical benchmarks after Reinforcement Learning training of 8 base models in 5 series

The above results show that integrating temporal consistency into the Reward Model not only helps improve the stability of RL training but also offers new possibilities for developing a more scalable RLHF process and achieving higher quality in inference.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。