HomeArticle

HKUST Proposes New Algorithm to Revolutionize Large Model Inference Paradigm: Random Strategy Valuation Becomes a "Magic Trick" for Mathematical Reasoning in LLMs

机器之心2025-10-31 16:26
"Simplify rather than complicate" is the key to improving performance.

He Haoran, the first author of the paper, is a Ph.D. student at the Hong Kong University of Science and Technology. His research interests include reinforcement learning and foundation models, with the goal of inspiring superintelligence through experience and rewards. Ye Yuxiao, the co-first author, is a first-year Ph.D. student at the Hong Kong University of Science and Technology. The corresponding author is Pan Ling, an assistant professor in the Department of Electronic and Computer Engineering and the Department of Computer Science and Engineering at the Hong Kong University of Science and Technology.

In the mathematical reasoning tasks of large language models (LLMs), reinforcement learning with verifiable rewards (RLVR) has become an important means to enhance the reasoning ability of models. However, mainstream methods such as PPO and GRPO still rely on the learning objectives of policy gradient updates designed for traditional RL scenarios, which can be essentially characterized by policy improvement, that is, a continuous cycle process of policy evaluation and policy improvement. These methods often face problems such as unstable training, loss of diversity, and complex parameter tuning.

So, is there a simpler and more fundamental solution for LLM reasoning tasks?

The research team, jointly led by the Hong Kong University of Science and Technology, Step, and Kuaishou proposed a surprising answer: simply evaluating the value of a completely random policy is sufficient to find the optimal reasoning path. They thus proposed ROVER (Random Policy Valuation for Diverse Reasoning), which subverts the traditional paradigm with a minimalist approach and skips the policy improvement cycle of traditional reinforcement learning reasoning.

ROVER not only significantly outperforms existing methods on multiple mathematical reasoning benchmarks but also achieves high-quality and high-diversity reasoning generation with "minimalism."

Currently, the paper, code, and model have all been open-sourced.

  • Paper link: https://arxiv.org/abs/2509.24981
  • Code link: https://github.com/tinnerhrhe/ROVER

In high-difficulty tasks such as AIME24, AIME25, and HMMT25, ROVER significantly improves pass@1 (+8.2) and pass@256 (+16.8) compared to traditional methods and reaches new heights in various diversity metrics (+17.6%). Moreover, ROVER does not require additional maintenance of a value network or a reference model for KL calculation, making it more lightweight.

The "Pain Points" of Traditional Reinforcement Learning: Complex Iteration and High Cost

In LLM reasoning optimization, mainstream methods (such as PPO and GRPO) can be characterized by Generalized Policy Iteration - repeatedly performing "policy evaluation (calculating the value of the current policy, such as estimating the advantage function)" and "policy improvement (updating the policy [mathematical formula])." Although these methods can improve performance, they have core pain points:

  • Poor training stability: The optimization objective is "non-stationary," and the model is prone to collapse. Recent work has added complex techniques such as KL regularization, clipped importance sampling, and entropy monitoring. These "patches" make training precarious, and a slight mistake can lead to "entropy collapse" (a sudden drop in policy diversity and getting stuck in a single reasoning path).
  • PPO needs to maintain an independent value network to predict the state value and repeatedly perform policy iteration: Methods like GRPO also need to maintain a reference model for KL calculation. This "heavy-asset" model increases the computational cost of RL optimization.
  • Loss of reasoning diversity: Sacrificing exploration for quality, the performance of pass@k saturates. Traditional reinforcement learning methods based on reward maximization make the model overly pursue the single-step reasoning accuracy, sacrificing the policy exploration ability - the model only generates a few reasoning paths, sacrificing the pass@k (the ability to cover more feasible solutions through multiple reasoning attempts).

The "Minimalist Revolution" of ROVER: The Q-value of a Random Policy is Sufficient to Guide Optimal Decisions

The research team first pointed out that LLM reasoning tasks can be modeled as a finite-horizon Markov decision process (MDP) with the following key characteristics:

  • Deterministic state transitions;
  • Tree structure (each state has a unique parent node, and there are no disjoint subtrees);
  • Binary sparse rewards (correct/incorrect).

This is quite different from the complex settings such as random state transitions, cyclic graph structures, and intermediate rewards commonly found in traditional RL tasks (such as Atari games and robot control).

"Are we using overly complex tools to solve a structurally simpler problem?" - This became the starting point of the ROVER research.

In this simple structure, the research team proved a subversive conclusion: The Q-value of a uniformly random policy directly points to the optimal policy.

Let the environment be an MDP with a finite horizon, a tree-structured state space, and binary rewards.

is a uniformly random policy (the probability of each action selection is 1/|A|),

is its Q-value. Then the greedy policy (as shown below) is the optimal policy!

The proof is intuitive: In a tree structure, if there is a correct answer in the subtree of an action

then

; otherwise

. Therefore, greedily selecting the action with the maximum

value will surely lead to a path containing the correct answer.

Therefore, the policy learning process can be simplified as shown in the following figure.

The ROVER Algorithm: Three Simple Steps, No Iteration Required

(1) Q-value estimation:

ROVER calculates the

value of state-action pairs under a uniformly random policy through the generalized Bellman equation. Therefore, the equation is expressed using the mean operator:

is the reward, s' is the new state after taking action a, and V is the action space.

(2) Policy construction:

Although greedy selection can guarantee optimality, it may lead to a loss of diversity. To address this, ROVER introduces softmax sampling based on the

value:

where

is the temperature coefficient that controls the degree of exploration. This approach not only retains the priority of high-value paths but also explores multiple effective reasoning routes, significantly improving the pass@k performance.

(3) Training objective:

In actual implementation, ROVER also introduces:

function internalized in the LLM parameters, eliminating the need to train an additional value network:

This "self-supervised" parameterization allows the model to learn "relative improvement" rather than "absolute value," reducing the computational cost and improving stability.

Group reward centralization reduces variance, that is

. This avoids the interference of high-variance rewards on the

value learning. At the same time, broadcasting the centralized rewards to