首页文章详情

TRM thinking reward model launched, the reasoning quality of large models can finally be quantified

量子位2026-06-24 14:43
The reasoning capabilities of large language models are becoming increasingly powerful, but if the final answer is correct, does that necessarily mean the thinking process is sound?

The reasoning ability of large models is getting stronger and stronger. But if the answer is correct, does it mean the thinking process is necessarily good?

It's like two students both getting the right answer to the same math problem.

One has clean steps, compact derivations, and a smooth train of thought; the other takes a long detour with irrelevant derivations, skips key steps, but still manages to come up with the correct answer.

Just looking at the results, both papers are correct. But if you have to choose a problem - solving process that is more worth learning, it's obviously the former.

Large - model reasoning also faces a similar problem.

Before giving the final answer, the model often generates a reasoning trace of thousands or even tens of thousands of tokens: it contains exploration, reflection, and correction, as well as going around in circles, skipping steps, and "false proofs" that seem complete but don't stand up to scrutiny.

However, most evaluations and reward signals only focus on whether the final answer is correct, which quietly erases the "difference in thinking processes".

What kind of thinking process is considered good? How can a free - form reasoning chain be stably evaluated? Can this evaluation signal, in turn, help the model learn better reasoning methods?

To address this issue, a research team from the Shanghai AI Laboratory, Shanghai Jiao Tong University, and the Chinese University of Hong Kong proposed TRM (Thinking Reward Model):

Instead of just looking at whether the large model "gets the answer right", it directly scores the model's reasoning process, turning "thinking well" into a measurable, trainable, and optimizable ability.

Specifically, the team proposed a unified framework: using the ME² principle to characterize the reasoning quality, using DAG - based pairwise evaluation to restore the reasoning structure, and training the Thinking Reward Model on this basis to turn the "reasoning quality" from a subjective feeling into a reusable reward signal.

Why is "whether the answer is correct" no longer sufficient?

In the past, many large - model evaluations mainly focused on whether the final answer was correct. For question - answering and coding problems, this approach is straightforward: get the answer right and score, get it wrong and don't score.

However, for reasoning models, only looking at the answer misses a key question: How did the model get this answer?

When both models get the answer to a question right, one model may advance step by step along the main line, while the other may repeatedly restart the same train of thought, conduct a large number of ineffective checks, or even use wrong steps to support the correct conclusion.

These low - quality reasonings not only increase the generation cost but also make the model more likely to make mistakes when the problem conditions change.

In reinforcement learning training, this problem is more obvious. If the reward only depends on the final answer, all reasoning chains that get the answer right will receive the same feedback. But beyond the answer, there is a need to further distinguish: Which reasoning chain is clearer, more compact, and more worth learning for the model? This is exactly the problem that TRM focuses on.

The overall framework of TRM is as follows: (a) Propose the ME² principle, (b) Abstract the complex reasoning structure with a DAG, (c) Train the Thinking Reward Model and use it for Test - Time Scaling and RL.

ME² principle: What kind of thinking process is considered good?

To evaluate the reasoning quality, we first need to clarify what "good" means.

The paper decomposes the reasoning quality along two orthogonal axes: in terms of granularity, there are macro (overall structure) and micro (single - step content); in terms of goals, there are efficiency (high - efficiency) and effectiveness (effectiveness). Combining them pairwise results in four dimensions:

- Macro - Efficiency: Is the overall structure efficient? A good reasoning chain will advance along the necessary branches, avoid repeatedly restarting the same train of thought, and not conduct too many ineffective checks.

- Macro - Effectiveness: Is the overall structure effective? The main reasoning line should always revolve around the problem goal, the relationships between branches should be clear, and key arguments should be consistent.

- Micro - Efficiency: Is the single - step expression concise? Each step should preferably have a clear purpose, such as calculation, verification, elimination, or induction, and less redundant content that does not affect the conclusion should be written.

- Micro - Effectiveness: Is the single - step content correct? Local calculations, symbol usage, and previous and subsequent conclusions need to be self - consistent, and wrong steps cannot be used to support the correct answer.

These four dimensions decompose "which reasoning is better" into signals that can be annotated, compared, and trained, forming the cornerstone of the subsequent evaluation and optimization process.

DAG - based Evaluation: Making free - form reasoning structured

The model's reasoning chain is usually a long string of natural - language text, which seems to unfold in order on the surface, but the real reasoning does not necessarily progress linearly. It may first advance along a main line, then expand several branches in the middle, eliminate some possibilities, and then merge the effective branches back.

What's more troublesome is that there are a large number of local details in the long text, which can easily drown out the really important structural signals. If the reasoning structure is not explicitly disassembled, it is difficult for the evaluation model to stably distinguish them.

Therefore, the paper abstracts the free - form reasoning chain into a directed acyclic graph (DAG). Specifically, it first cuts the original text into a series of atomic steps, takes each step as a node, and then connects the edges according to the semantic dependency relationship. In this way, the progression (linear advancement), branching (branch exploration), and merging (branch merging) in the reasoning chain can be clearly presented.

For this purpose, the paper abstracts any reasoning chain into a directed acyclic graph (DAG) and divides this process into three steps:

1. Step Partitioning: First, make a rough segmentation by paragraphs, and then count the high - frequency starting words in a large number of trajectories as more stable delimiters to obtain consistent and semantically meaningful step boundaries. 2. Reasoning Structuring: Traverse each reasoning step in chronological order, use a large model to assign its semantic parent node, and gradually build the edges; then merge completely linear adjacent nodes into super - nodes to obtain a compact DAG, clearly presenting complex structures such as progression (linear advancement), branching (branch exploration), and merging (branch merging). 3. Pairwise Evaluation: Construct semantic abstractions according to the ME² principle, and then let the evaluation model give the relative preference between two reasoning chains based on these abstractions. The two granularities of Macro and Micro correspond to different abstraction methods, covering the four dimensions of the ME² principle.

In this way, the evaluation model doesn't have to just stare at a long text but can look along the reasoning structure: whether the main line is clear, whether the branches are necessary, and whether the local steps are concise and correct. The judgment obtained in this way is also more stable than directly looking at the original text.

Thinking Reward Model: Turning reasoning quality into a reusable reward signal

Based on the above evaluation framework, the research team constructed the TRM - Preference dataset. For each problem, the researchers first used multiple open - source reasoning models to generate candidate reasoning chains, and then screened out the trajectories with wrong answers through a rule - based validator, only retaining the samples with correct final answers.

In this way, the focus of subsequent comparisons shifts from "whether the answer is correct" to "when the answers are all correct, which reasoning chain is better".

Subsequently, the paper used DeepSeek - V3.2 to conduct pairwise evaluations on the DAG in the four dimensions of ME². To reduce the position bias, the evaluation is repeated in both positive and negative presentation orders, and only the preference labels that are stable and not tied are retained. Finally, 103K training preference pairs + 1.5K validation preference pairs are obtained, which constitute the TRM - Preference dataset.

TRM is initialized with Llama - 3.1 - 8B - Instruct, and the language - modeling head is replaced with a scalar value head. After training on the TRM - Preference dataset, TRM will output a scalar score for each reasoning chain: the higher the score, the more it conforms to the ME² definition of high - quality reasoning.

On the validation set, TRM achieved an accuracy rate of 88.6%, significantly better than two representative PRM baselines.

Core finding 1: Answers obtained from high - quality reasoning chains are more reliable

TRM evaluates the quality of the reasoning chain, but this signal can also improve the accuracy of the final answer in turn.

During testing, TRM can be used in Best - of - N selection: let the model generate multiple candidate reasoning chains for the same problem, and then let TRM select the one with the highest quality. Experiments show that as N increases, the results selected by TRM can bring higher final accuracy.

Core finding 2: Used as an RL reward, the model answers more accurately

During the training phase, TRM can also provide more fine - grained reward signals for reinforcement learning.

Traditional RLVR usually only looks at whether the answer is correct. After adding TRM, the model can continue to learn clearer and more efficient reasoning methods on the basis of getting the answer right.

Specifically, the paper uses the GRPO algorithm to combine the verifiable reward

with the thinking reward given by TRM

The "gating" key lies in: only when the answer is correct does TRM participate in reward shaping, and the reward for wrong trajectories is always 0, preventing the model from learning bad habits from wrong trajectories.

The experimental results show that this approach brings performance improvements in multiple models and tasks.

The RL training results are as follows. As an auxiliary reward, TRM brings more stable performance improvements in multiple models and STEM/Math tasks.

Core finding 3: Not only are the answers more accurate, but the reasoning process is also better

However, performance improvement does not necessarily mean that the reasoning process has become better. To further verify this, the paper used DeepSeek - V3.2 to conduct pairwise comparisons of the reasoning chains generated by different training strategies according to the ME² principle.

The results show that on three base models, the strategies trained by TRM have a higher winning rate compared with various baseline strategies.

This shows that TRM makes the reasoning process generated by the model closer to clear, efficient, and reliable reasoning.

The winning rate of reasoning quality under different training strategies. The blue dotted line represents a 50% winning rate.

As large models move towards complex mathematical, scientific reasoning, agent planning, and long - term task execution, the importance of the reasoning process will continue to increase.

Future models need to not only get the answers right but also be better at organizing their thoughts, reducing ineffective branches, and grasping key steps.

The significance of TRM lies in that it turns thinking well from a subjective feeling into a measurable, trainable, and optimizable ability.

Paper title: Characterizing, Evaluating, and Optimizing Complex Reasoning

Link: https://arxiv.org/abs/2602.08498

Code: https://github.com