HomeArticle

Tsinghua team creates TDRM: A smooth reward model based on temporal difference learning

学术头条2025-10-09 16:51
Tsinghua University proposed the TDRM framework to enhance the temporal consistency of reward models and optimize reinforcement learning training.

The reward model (RM) plays a central role in both reinforcement learning (RL) based on large language models (LLMs) and verification during inference, and has demonstrated excellent performance in tasks such as mathematical problem - solving, code generation, and instruction following.

However, existing reward models have a critical flaw - lack of temporal consistency, which leads to problems such as "poor policy update effects" and "unstable reinforcement learning training".

Specifically, the reward for a step in the LLM inference trajectory is often independent of the rewards of adjacent steps, resulting in inconsistent and misleading signals during the training process and making it difficult to provide effective guidance during the inference phase. These problems are particularly prominent in the long chain - of - thought (CoT) scenario - the model cannot obtain any reward before completing a long series of inference steps, making it extremely difficult to judge "which step is useful and which step is redundant".

To address this pain point, a team from Tsinghua University in collaboration with the California Institute of Technology proposed the TDRM framework - by minimizing the temporal difference (TD) during training to learn a smoother and more reliable reward model.

It is worth mentioning that all code, data, and language model checkpoints have been open - sourced on GitHub.

Paper link: https://arxiv.org/abs/2509.15110

GitHub address: https://github.com/THUDM/TDRM

The research results show that the process reward model (PRM) trained with TD can achieve performance improvements of up to 6.6% and 23.7% in Best - of - N and tree search scenarios respectively.

Furthermore, when combined with verifiable reward reinforcement learning (RLVR), the process reward model trained with TD can achieve more data - efficient reinforcement learning - using only 2.5k data to achieve comparable performance to the baseline method that requires 50.1k data - and obtain higher - quality language model policies on 8 model variants such as Qwen2.5 - (0.5B, 1.5B), GLM4 - 9B - 0414, and GLM - Z1 - 9B - 0414.

Building a smoother and more reliable reward model

Different from previous methods that use temporal differences to construct offline datasets of intermediate reward signals, TDRM uses temporal difference learning to construct a reliable reward model for reinforcement learning training, thereby generating a smoother reward space and denser reward signals.

According to the paper, the TDRM framework consists of the following three core modules:

  • Process reward model: The process reward model is obtained through n - step temporal difference learning combined with reward shaping training.
  • Reinforcement learning: Online reinforcement learning is carried out under the guidance of the trained process reward model to optimize policy updates.
  • TDRM integration: The process reward and the verifiable reward are effectively linearly combined and applied to the Actor - Critic style online reinforcement learning of different policy model series and scales.

Figure | Schematic diagram of the overall framework of TDRM

The temporal difference method can iteratively optimize the policy value estimation by utilizing the interdependence between states. Specifically, the n - step temporal difference algorithm integrates the rewards and value estimates of the subsequent n states and discounts the future rewards using an exponential decay factor, which can both encourage players to obtain early rewards in a timely manner and balance the relationship between short - term gains and long - term behavioral consequences.

1. Smoothness

Smoothness is an important characteristic of effective reward modeling during the inference process, as it reflects the consistency and stability of value updates in intermediate steps, ensuring that small changes in the inference trajectory do not lead to disproportionate deviations in value estimates. To evaluate smoothness, the team used two complementary methods to compare the performance of ScalarPRM and TDRM.

  • The local Lipschitz constant: It is used to quantify the sensitivity of the reward to changes in adjacent states. Analysis shows that compared with ScalarPRM (0.3331), TDRM (0.2741) has a smaller average Lipschitz constant between consecutive steps, indicating that its reward transition is smoother and has better temporal consistency;
  • TD error: By calculating the TD error between consecutive inference steps and combining the value differences between inference steps, the continuity and consistency of the estimated value function are evaluated from two dimensions.

Figure | Comparison of the smoothness of reward models

However, previous studies have shown that the length of CoT does not always increase stably during the LLM inference process. Combining the above analysis of reward smoothness, the research team believes that reward shaping is the key mechanism to stabilize this emergent length - scaling behavior.

2. Reward modeling

According to the paper, in the PRM framework based on temporal differences, reward shaping has dual purposes: on the one hand, it optimizes the temporal difference update by providing structured feedback; on the other hand, it alleviates the volatility of reward signals under different inference lengths. It includes:

  • Cosine Reward: A cosine - based reward function is implemented to adapt to the correctness and relative length of each inference step. It assigns different reward ranges to correct and incorrect steps. The reward starts from the maximum value and gradually decays to the minimum value as the inference length approaches the maximum length.
  • Temporal difference: The calculated cosine reward is combined with the temporal difference framework to update the process reward model.
  • TD - λ: Compared with the n - step temporal difference, TD - λ is a more flexible online algorithm. Due to its online nature, TD - λ allows the process reward model to propagate information to earlier states immediately after observing the reward.
  • Loss function: To optimize the process reward model, cross - entropy loss is used, and the clamped temporal difference target is used as the soft label for each inference step, enabling the model to learn from the temporal consistency of rewards.

3. Reinforcement learning

In terms of reinforcement learning, the research team designed it as an online algorithm, which dynamically calculates the temporal difference target using on - the - fly state values during training. Different from offline algorithms that rely on pre - calculated state values, this method can adapt to changing trajectories and use observed trajectories to estimate the state values of unobserved trajectories. This adaptability ensures more accurate value prediction, thereby enhancing the consistency and robustness of the reward model.

Figure | Processing process of the TDRM algorithm

In TDRM, the verifiable reward and the process - based reward are combined through linear combination to leverage their complementary advantages. This combined reward signal is used to train the GRPO objective function, thereby improving the overall performance and data efficiency of the learning process.

For more technical details, please refer to the paper.

What's the real - world effect?

To verify the effectiveness of TDRM, the research team tested the performance of TDRM in two scenarios: verification during inference and online reinforcement learning during training.

  • For verification during inference, different reward models were compared through two key settings. The Best - of - N sampling method first generates a pool of N potential outputs, and then applies the reward model to determine a single best candidate, aiming to balance the diversity and optimality of the output results. Greedy Search generates the output by iteratively selecting the sequence with the highest score.
  • For online reinforcement learning during training, TDRM was compared with mainstream methods on 5 challenging datasets (MATH - 500, Minerva Math, Olympiad Bench, AIME24, and AMC23). Referring to the SimpleRL method, the Pass@1 metric combined with greedy decoding was used to evaluate the performance of the final task.

1. Reward modeling

By observing the Best - of - N sampling results of different models and datasets, the research team provided empirical evidence for the superiority of TDRM.

First, on the MATH - 500 dataset, as the sampling budget increased from Best - of - 128 to Best - of - 1024, TDRM significantly outperformed ScalarPRM and ScalarORM. This fully proves that TDRM has stronger reliability and can continuously identify the optimal response with a larger sampling budget.

Table | Test results on MATH - 500; Best - of - 128 results on GSM8K

In the tree search evaluation, TDRM once again demonstrated superior performance and provided more accurate verification of the inference trajectory. Moreover, the accuracy of TDRM increases with the number of search branches, reflecting its effectiveness in navigating complex decision - making spaces.

Figure | Tree search results

2. Reinforcement learning

TDRM successfully outperformed 8 mainstream models on a limited dataset of only 2500 MATH Level - 3 prompts, achieving the highest average accuracy, highlighting its reliability in reinforcement learning training.

TDRM ensures stable performance and better data efficiency by combining verifiable rewards and process - based rewards, and can achieve continuous learning even with a limited number of training samples.

Table | Evaluation results on mathematical benchmarks after reinforcement learning training on 8 base models in 5 series

The above results indicate that integrating temporal consistency into the reward model not only helps improve the stability of RL training but also provides new possibilities for building a more scalable RLHF process, achieving higher - quality inference search, and promoting the wide application of LLMs in complex goal - combination scenarios.

This article is from the WeChat official account “Academic Headlines” (ID: SciTouTiao), author: Xiaoyu. It is published by 36Kr with authorization.