Tackling the challenges of AI inference, a Tsinghua University team proposed a new unified paradigm for large language model (LLM) reinforcement learning called ReST-RL.
Can large language models (LLMs) really reason? The industry is divided on this issue.
This is because current LLMs often "stumble" when faced with complex code, multi-step logic, and abstract tasks, exhibiting many problems such as logical leaps, chaotic steps, and irrelevant answers.
Relying on human teaching? It's too slow. Relying on rewards? The signals are too weak. Relying on verification? The data is too expensive. Balancing reasoning ability, training efficiency, and universality has become a challenge in the industry.
In response to these challenges, the Knowledge Engineering Group (KEG) of the Department of Computer Science and Technology at Tsinghua University proposed a unified new paradigm for LLM reinforcement learning (RL) - ReST-RL. This method combines an improved GRPO algorithm with a carefully designed test-time decoding method assisted by a value model (VM), which not only enhances the reasoning ability of LLMs but also takes into account efficiency, stability, and scalability.
Paper link: https://arxiv.org/abs/2508.19576
Experimental results show that on well-known programming benchmarks of different levels such as APPS, BigCodeBench, and HumanEval, ReST-RL outperforms other reinforcement training baselines (such as the original GRPO and ReST-DPO) and decoding and verification baselines (such as PRM-BoN and ORM-MCTS).
This indicates that ReST-RL has great potential in enhancing the reasoning ability of LLM strategies and provides new ideas for the RL path of LLMs.
Existing RL methods struggle to achieve true reasoning
More and more studies have shown that RL can improve the reasoning ability of LLMs, and this direction has become a current research hotspot.
Some methods use online RL, where data sampling and model updates are performed simultaneously. A representative method is Group Relative Policy Optimization (GRPO). Other methods advocate obtaining training data through offline sampling and screening mechanisms. This paradigm is usually called self-training, and its representative method is Reinforced Self-Training (ReST). Despite different training mechanisms, both types of methods can effectively improve the reasoning ability of LLMs.
Reward models (RMs) are receiving increasing attention due to their important role in output verification. Existing studies have shown that an outcome reward model (ORM) that verifies the final output of an LLM can improve reasoning accuracy. Multiple process reward models (PRMs) have also been used to provide feedback for intermediate steps, and their verification effect is better than that of ORM.
However, these methods still have limitations. On the one hand, online RL algorithms represented by GRPO often lead to unsatisfactory training results due to the weak difference in reward signals. Although some studies have tried to alleviate this problem by designing step-by-step rewards or introducing simple dynamic sampling mechanisms, this often brings higher computational costs and poor generalization ability, and also makes the RL algorithm more complex. On the other hand, although PRMs are better than ORMs in verifying outputs, their training process usually relies on high-quality labeled data. Due to the high cost of data annotation, it is difficult to expand the training data for PRMs, thus limiting their accuracy and reliability.
Some studies have proposed estimating and collecting process rewards through Monte Carlo simulations. However, these methods are difficult to generalize to more complex reasoning tasks, and their dependence on the result matching mechanism also limits their scope of application.
Overall, existing methods struggle to achieve a comprehensive balance among data collection cost, generalization ability, reinforcement effect, and training efficiency.
ReST-RL: Dual optimization of training and reasoning
ReST-RL provides new possibilities for solving the problems of training reward differences and PRM accuracy. This method consists of two main parts, ReST-GRPO (a reinforced self-training method based on group relative policy optimization) and VM-MCTS (Monte Carlo tree search based on a value model).
Figure | ReST-RL framework
ReST-GRPO uses an optimized ReST algorithm to perform GRPO, thereby enhancing the ability of the strategy in complex reasoning tasks. This method uses the strategy itself to screen and combine training data, effectively alleviating the problem of GRPO reward failure and enhancing the ability of the strategy to generate reliable reasoning trajectories.
The output solutions of the LLM and their corresponding rewards contain rich information, reflecting its strengths and weaknesses in the target task domain. This information can be used to filter out invalid training data.
The research team used the standard deviation to evaluate the diversity of rewards. For prompts whose standard deviation of rewards for all solutions is lower than the preset threshold σ₀, they were removed from the training set. The training process focused on those high-reward solution trajectories, and finally, new training data was constructed using their partial solution states.
Compared with ordinary GRPO, ReST-GRPO can significantly increase the reward variance during the training process.
Figure | Distribution of the standard deviation of group rewards during the strategy training process.
VM-MCTS is used for decoding during the LLM testing phase. Here, the value model (VM) acts similarly to a PRM. It not only provides verification signals but also guides the LLM strategy to explore more promising reasoning paths. The value target of the VM is used to evaluate the entire partial state including the last step, rather than a single action or step. It naturally reflects the potential of the strategy to reach a high-reward final state from the current local state and can assist the strategy during the decoding process.
When collecting training data for the VM, they used the MCTS method to balance the exploration of different reasoning paths and the utilization of high-potential intermediate states. Once enough value target data is collected, the VM can be trained to predict the values of various states.
The VM trained using this method can accurately predict the expected rewards of partial states under the current strategy. This algorithm determines which paths should be explored and decoded through value estimation, thereby improving the efficiency and accuracy of the search.
The research team verified the effectiveness of the proposed RL paradigm and its components through a large number of coding problem experiments, proving that ReST-RL can not only enhance the reasoning ability of LLM strategies but also achieve a good balance in terms of efficiency, cost, and generalization.
The results show that ReST-RL and its components comprehensively outperform other reinforcement learning baseline methods (such as the original GRPO and ReST-DPO) and decoding and verification baseline methods (such as PRM-BoN and ORM-MCTS) in terms of performance.
A test comparison under the same number of training steps shows that ReST-GRPO has higher training efficiency than the original GRPO and DAPO.
Under the condition of the same decoding verification budget, VM-MCTS and its VM are more accurate than the previously publicly-data-trained Math-Shepherd style PRM or ORM.
Figure | Tests of training efficiency and verification ability within the budget.
Limitations and future directions
Although various experiments have proven the effectiveness of ReST-RL, this method still has certain limitations.
For example, the research has not verified it in tasks other than code reasoning (such as mathematical reasoning and common-sense reasoning). Although the method framework is not limited to code tasks, its application in other scenarios may require redesigning appropriate reward mechanisms and experimental hyperparameters.
In addition, the specific impact of some experimental settings on the final results has not been systematically analyzed.
The research team also said that the accuracy of the value model in out-of-domain tasks still lacks sufficient research, and future work will further explore the generalization ability of ReST-RL in a wider range of tasks.
This article is from the WeChat official account "Academic Headlines" (ID: SciTouTiao), compiled by Xiaoyang, and published by 36Kr with permission.