Overturning the post-training of large models, Chen Danqi's team proposed "Reinforcement Learning with Model-based Reward Thinking" (RLMT).
In daily life, when humans tackle open-ended tasks such as writing emails, drafting outlines, or planning meals, they always first organize their thoughts in their minds and then start to complete the tasks. This "deep reasoning" ability, referred to as "system 2 thinking" by Nobel laureate in economics and psychologist Daniel Kahneman, is a core characteristic of human intelligence.
Although Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning ability of Large Language Models (LLMs) by using rule-based rewards in verifiable domains such as mathematics and code, its generalization ability in open-ended tasks remains limited.
In a recent study, a team led by Associate Professor Danqi Chen from Princeton University achieved a breakthrough by transferring the reasoning ability from verifiable domains to general chat scenarios.
In terms of implementation, they proposed the "Reinforcement Learning with Model-based Thinking" (RLMT) framework, which enables LLMs to generate a long Chain of Thought (CoT) before responding and then optimize it through online RL with a preference-based reward model.
According to the paper, the 8B model trained with RLMT outperformed GPT-4o in chat and creative writing and was comparable to Claude-3.7-Sonnet (Thinking). Meanwhile, using only 7K prompts, the Llama-3.1-8B base model trained with RLMT outperformed the Llama-3.1-8B-Instruct model, which was trained through a complex multi-stage process with over 25M examples.
Paper link: https://arxiv.org/abs/2509.20357
The research team stated that the results of this study will prompt people to reevaluate the post-training pipeline and called on future research to comprehensively understand and apply thinking ability.
RLMT: A Training Framework Integrating Two Paradigms
To understand the innovation of the RLMT framework, we need to first identify the two major pain points in current language model training:
On one hand, although Reinforcement Learning from Human Feedback (RLHF) can align with human preferences, it treats the model output as a single entity and lacks explicit reasoning guidance.
On the other hand, while RLVR can enable the model to generate long CoTs through rule-based rewards in domains such as mathematics and code, its generalization ability in broader reasoning problems and chat benchmarks is still insufficient, making it difficult to generalize to general chat scenarios without clear "correct answers".
The RLMT framework retains the pattern of RLVR, which first generates a reasoning trajectory and then outputs the result and adopts the preference-based reward model of RLHF, enabling the model to learn to "think" in open-ended tasks.
Specifically, the RLMT framework requires the language model to generate a detailed reasoning trajectory before generating the final response and then optimize the entire "reasoning + response" process through online reinforcement learning, such as the GRPO algorithm, using a preference reward model.
Figure | Training a language model based on a long chain of thought through reinforcement learning and a reward model can handle diverse general user prompts. Compared with RLHF, RLMT allows the model to think and extends RLVR to broader and more open-ended tasks.
Figure | An example reasoning trajectory generated by an LM trained with RLMT for an open-ended query.
To achieve this goal, the team carefully designed three key aspects:
In the training algorithm selection aspect, the team tested three mainstream on-policy deep reinforcement learning algorithms: DPO, PPO, and GRPO. They found that although the best model was trained with the GRPO algorithm, RLMT still outperformed traditional RLHF even when using algorithms such as DPO or PPO, and the performance of the models under all settings was better than that of the baseline model.
In the reward model aspect, the team selected Skywork-v1-Llama-3.1-8B-v0.2, which demonstrated excellent performance in both reward benchmarks and downstream applications. Subsequent experiments proved that a powerful reward model is crucial for RLMT. The strength of the reward model affects the performance ceiling, but RLMT outperformed RLHF under reward models of different strengths.
In the prompt library construction aspect, the team abandoned datasets containing a large number of math problems and jailbreaking prompts and chose the WildChat-IF subset of Tülu 3, which consists of 7.5k real user conversation prompts selected from the WildChat platform, covering general scenarios such as daily chat and creative writing, and is more in line with actual usage requirements.
Meanwhile, RLMT also supports two flexible training modes. It can either be initialized through supervised fine-tuning (SFT) by using Gemini 2.5 Flash or GPT-4.1-mini to generate prompt-response pairs with reasoning trajectories; or it can be directly applied to base models without any post-training, i.e., in zero-training mode, guiding reasoning behavior only through fixed instruction prefixes.
Experimental Verification: Small Models Can Outperform Large Ones, and Zero Training Can Also Work
To verify the effectiveness of RLMT, the team conducted 40 training sessions on the base and instruction versions of two model families, Llama-3.1-8B and Qwen-2.5-7B, covering 7 types of benchmarks such as chat, creative writing, and knowledge Q&A, and used the RLHF model without a reasoning process under the same settings as a control.
The results shocked the researchers. The RLMT models significantly outperformed in all tasks. The thinking models trained with RLMT consistently outperformed non-thinking models by 1.5 - 4 points on average in all benchmarks. The advantage was most significant in the core chat benchmark, with an average score difference of 3 - 8 points between the model and the baseline model, and these models generally performed better in creative writing and factual Q&A tasks.
Table | Test results of GRPO models trained on Llama-3.1-8B and Qwen2.5-7B under warm-start and zero-training settings.
More notably, small models demonstrated stronger capabilities than large models. The Llama-3.1-8B-Instruct-RLMT scored 50.4 on WildBench, surpassing not only models with nearly 10 times more parameters, such as Llama-3.1-70B-Instruct and Qwen2.5-72B-Instruct, but also GPT-4o.
Table | Comparison of Llama-3.1-8B-Instruct RLMT with strong open-source and closed models, including GPT-4o and Claude -3.
Even skipping the complex SFT stage, RLMT still significantly improved the base model. Taking Llama-3.1-8B as an example, the zero-trained RLMT model, Llama-3.1-8B-RLMT-Zero, had an average chat score of 15.6, 5.5 points higher than that of Llama-3.1-8B-Instruct, which was trained through multi-stage fine-tuning with over 25 million samples. Qwen2.5-7B-RLMT-Zero even directly outperformed Qwen2.5-7B-Instruct.
Table | Results of warm-start and zero-training DPO/PPO.
Ablation experiments further revealed the key success factors of RLMT: prompt quality, reward model strength, and the reasoning process are all indispensable. Models trained with real conversation prompts scored 5 - 7 points higher than those trained with simple prompts or prompts containing a large number of math problems. A strong reward model can improve the chat ability of the model while maintaining its performance in non-chat tasks, while a weak reward model leads to an overall decline, but RLMT still outperformed RLHF in this setting, which proves that the value of "letting the model think" does not depend on a specific reward model.
Table | Ablation experiments on the GRPO immediate hybrid model, SFT data source, and reward model.
Enabling Models to Think More Smartly
Through qualitative and quantitative analysis, the team found that RLMT not only improved the model performance but also fundamentally changed the way it "thinks".
Figure | Left: Direct win rate comparison at the feature level between SFT and GRPO models; Right: Example reasoning behavior
In terms of reasoning style, the planning of SFT models is more like a "linear list". After receiving a task, it first divides it into chapters and sub-chapters and proceeds step by step; while the RLMT model demonstrates a more complex reasoning pattern closer to that of humans: it first carefully enumerates task constraints and core sub-topics, then groups scattered ideas by topic, and finally iteratively optimizes the details. More notably, the RLMT model also "reflects backward". It revisits and adjusts early content during the later stages of planning, such as cross-referencing previously mentioned points to make the overall logic more coherent.
Figure | As training progresses, Llama-3.1-8B-RLMT-Zero takes longer to think and answer.
This change in thinking mode is also reflected in the length of reasoning. During training, the length of the reasoning trajectory and the final response generated by the RLMT model continuously increased. Taking Llama-3.1-8B-RLMT-Zero as an example, as the training steps progressed, the number of tokens in the reasoning part increased from less than 200 in the initial stage to over 600, and the response length also increased accordingly, indicating that the model learned to take more time to organize its thoughts instead of outputting hastily.
To more accurately capture the differences, the team also used GPT-4.1-mini to extract reasoning features from 1024 WildBench examples. The results showed that the RLMT model far exceeded the SFT model in terms of winning rates in features such as "weighing different viewpoints", "grouping ideas into topics", and "integrating constraints into the plan", while the feature of "strict step-by-step structure" was significantly weakened. This indicates that the model's reasoning shifted from "mechanical step-by-step" to "flexible optimization", closer to the way humans solve complex tasks.
In the past, post-training of language models often relied on the training method of "massive data + multi-stage fine-tuning". For example, Llama-3.1-8B-Instruct requires a complex process including supervised fine-tuning, rejection sampling, and iterative preference optimization, using over 25 million samples. However, the emergence of RLMT breaks this paradigm. Using only 7K real conversation prompts, the Llama-3.1-8B base model can outperform the instruction model optimized through the above complex process.
The significance of this achievement goes far beyond the technical breakthrough itself. It proves that improving the general ability of language models does not necessarily require a large amount of data accumulation but can be achieved by stimulating the model's "thinking ability". The RLMT framework not only provides a new solution for general chat tasks but also redefines the post-training process of language models. In the future, enabling models to "think" may become a core aspect as important as "pre-training" and "supervised fine-tuning".
Of course, the research also has limitations. The team admitted that they have not yet determined whether the performance improvement is due to the enhancement of the model's original features or the learning of new features, and they have not deeply optimized the reasoning trajectory format and training hyperparameters. However, this also leaves broad space for future research, such as exploring better CoT formats, extending RLMT to fields such as logical reasoning and long text generation, and even integrating "thinking ability" into multimodal models.
From enabling models to "speak" to enabling them to "think", RLMT has taken a crucial step. When language models can not only generate fluent text but also organize their thoughts and weigh pros and cons like humans, perhaps we are one step closer to Artificial General Intelligence (AGI), which truly understands human needs.