HomeArticle

The 14B model defeats the 671B model. Microsoft's rStar2-Agent outperforms DeepSeek-R1 in mathematical reasoning.

机器之心2025-09-02 15:33
Was the 671B DeepSeek-R1 actually outperformed by a 14B model in mathematical reasoning?

Currently, large language models (LLMs) have achieved remarkable reasoning capabilities, and the key lies in test-time scaling.

Generally speaking, extending the chain of thought (CoT) can prolong the "thinking time," thereby significantly enhancing performance, especially when optimized using large-scale reinforcement learning with verifiable rewards (RLVR).

However, for difficult problems prone to minor intermediate errors or requiring creative reasoning shifts, longer chains of thought still have fundamental limitations. In such cases, models often rely on internal self-reflection, which frequently fails to detect errors or self-correct when the initial approach is flawed.

Therefore, models should not only be able to think for longer periods but also think "more intelligently." To achieve this, more advanced cognitive abilities can be introduced, enabling models to autonomously utilize appropriate tools, reason, verify, and learn from the feedback signals provided by the tool environment.

Recently, a research team from Microsoft Research explored using agentic reinforcement learning to achieve this goal. That is, the model interacts with tools in a dedicated tool environment and adjusts its reasoning approach based on the received feedback.

The result of their exploration is rStar2-Agent, a powerful agentic reinforcement learning method. Using this method, the Microsoft team trained a 14B reasoning model, rStar2-Agent-14B - this model has achieved state-of-the-art performance, comparable to or even surpassing that of the 671B DeepSeek-R1!

This research has received extensive attention on social networks.

Next, let's briefly understand how Microsoft created this model that can compete with larger ones.

Paper title: rStar2-Agent: Agentic Reasoning Technical Report

Paper link: https://arxiv.org/pdf/2508.20722

Code link: https://github.com/microsoft/rStar

Environment and Problem Description

The environment used in this research is the Python programming tool and interpreter.

The Python programming tool can expand the model's action space, enabling it to explore alternatives and verify intermediate steps, thus supplementing internal self-reflection when a longer CoT alone is insufficient.

However, effectively scaling agentic reinforcement learning in this environment is very challenging.

First, the inherent complexity of the programming tool and Python interpreter introduces environmental noise into the reasoning process. When the model inevitably generates syntactically or logically incorrect code, the resulting environmental feedback (e.g., error messages) may cause the model to waste valuable tokens on error correction instead of advancing the reasoning. Unfortunately, current reinforcement learning methods mainly rely on "outcome-only rewards," which exacerbates this problem. Since trajectories with failed intermediate tool calls can still receive positive rewards as long as the final answer is correct, the model will consider errors acceptable and generate long and low-quality reasoning trajectories.

Second, large-scale agentic reinforcement learning training has high infrastructure requirements. A single training batch can trigger tens of thousands of concurrent tool calls, making it extremely challenging to build a reliable and responsive code execution environment.

Moreover, deploying agents that interact with the environment amplifies the inefficiencies in standard reinforcement learning system deployment, significantly slowing down the overall training speed.

Three Key Innovations of rStar2-Agent

The rStar2-Agent proposed by Microsoft includes three key innovations.

First, the team built an efficient and reliable infrastructure for large-scale agentic reinforcement learning.

They constructed a high-throughput, independent code environment capable of handling 45K concurrent tool calls, with an average execution feedback return time of only 0.3 seconds.

To address the inefficiency of reinforcement learning rollouts, they introduced a load-balanced rollout scheduler that dynamically allocates rollout requests based on the available key-value cache capacity on the GPU, thereby maximizing computational utilization.

Even with limited GPU resources, this infrastructure enables efficient reinforcement learning training. Using 64 MI300X GPUs, the team completed the training of rStar2-Agent-14B in just one week.

Second, to achieve effective agentic reinforcement learning in the code environment, the team proposed Group Relative Policy Optimization with Correct Resampling (GRPO-RoC), which combines GRPO with a rollout strategy based on correct resampling (RoC) to address the noise caused by the environment under sparse and outcome-only reward conditions.

Specifically, RoC first oversamples a larger rollout group and then downsamples it to the standard batch size. Positive trajectories are filtered to retain only those with the highest quality and the fewest tool-induced errors or formatting issues, while negative trajectories are uniformly downsampled.

This simple yet effective asymmetric sampling method retains various failure modes as informative negative signals while emphasizing higher-quality successful cases for positive supervision.

Compared to methods that explicitly penalize tool usage errors in the reward function, GRPO-RoC improves training stability and avoids the risk of reward-hacking.

By learning from cleaner and higher-quality positive trajectories, the model not only increases the utilization of the Python programming tool but also demonstrates advanced cognitive abilities, enabling more efficient and concise reasoning in real code environment interactions.

Third, the team also proposed a training scheme that can elevate a 14B pre-trained base model to the state-of-the-art level in mathematical reasoning with minimal computation.

Different from previous studies (which applied inference-intensive supervised fine-tuning (SFT) before reinforcement learning), the team started with a non-inference SFT stage - only for instilling general instruction following, programming tool usage, and formatting, without enhancing reasoning abilities. This avoids potential SFT overfitting and keeps the initial average response short, allowing reinforcement learning to more effectively cultivate reasoning abilities while fully leveraging the model's pre-training capabilities.

Then, the team used GRPO-RoC for multi-stage reinforcement learning training, gradually increasing the task difficulty and maximum training duration. Different from previous reinforcement learning methods that require significantly expanding the rollout scale to 16K→48K or even higher, the team limited the length of each stage to a shorter range (8K→12K). This significantly reduces the reinforcement learning cost while encouraging more efficient reasoning strategies.

The model can quickly achieve state-of-the-art mathematical reasoning with only 510 reinforcement learning steps, demonstrating strong capabilities and excellent training efficiency.

Impressive Results

Finally, using the new method, they trained a model named rStar2-Agent-14B. Although it is only 14B in size, it has achieved powerful mathematical reasoning performance that surpasses leading reasoning models such as DeepSeek-R1 and Kimi k1.5.

Notably, on AIME24, its accuracy reached 80.6%, 1.0%, 0.8%, and 3.6% higher than o3-mini (medium), DeepSeek-R1, and Claude Opus 4.0 (thinking) respectively. On AIME25 and HMMT25, it reached 69.8% and 52.7% respectively, demonstrating stable and consistent strong capabilities.

Beyond mathematics, although it was only trained using agentic reinforcement learning on mathematical tasks, it can still effectively generalize.

It outperforms DeepSeek-V3 on the GPQA-Diamond scientific reasoning benchmark, performs well on the agent tool usage task in BFCL v3, and achieves competitive results on general benchmarks such as IFEval and Arena-Hard.

The team also reported unsuccessful attempts and analyses, highlighting the discoveries of more advanced cognitive reasoning behaviors brought about by rStar2-Agent's agentic reinforcement learning, such as environment feedback reflection tokens that drive more effective reasoning.

For more analyses and ablation studies, please refer to the original paper.

This article is from the WeChat official account "Almost Human" (ID: almosthuman2014), author: Almost Human, editor: Panda, published by 36Kr with authorization.