HomeArticle

Is GRPO outdated?

机器之心2026-06-21 11:01
GLM-5.2 no longer uses GRPO, sparking a heated debate about the future direction of reinforcement learning

On June 13th, Zhipu announced on the X platform that GLM-5.2 would be fully open-sourced, and set the official opening time at 5:21 PM that night - a "special moment".

Many people believe this number wasn't randomly chosen: The moment the US government issued an export control order to Anthropic, cutting off overseas access to Fable 5 and Mythos 5, was exactly 5:21 PM Eastern Time. This repetition of "5:21" has been interpreted by multiple media outlets as a deliberately designed response. By choosing to step forward at this moment, Zhipu is, in effect, telling developers around the world: The concern of "models being taken back at any time" doesn't exist in the open-source world.

This release truly justifies such a stance. GLM-5.2 is a MoE model with 744B parameters and 40B activated, fully open-sourced under the MIT license, and supports a truly usable 1M token context. On the long-range task benchmark FrontierSWE, it scored 74.4%, approaching Claude Opus 4.8's 75.1% and surpassing GPT-5.5's 72.6%. Many developers have reported after actual testing that this is the first open-source model that makes them seriously consider replacing Opus or GPT in their workflows.

Yesterday, the blog post "How GLM-5.2 Beat Fable 5 in Website Design" released by Design Arena became a viral article, attracting wide attention and sparking heated discussions.

However, what really stirs up the tech community more than these scores is a detail almost buried in the corner of a technical blog: GLM-5.2 abandoned GRPO during the long-range reinforcement learning phase.

Image source: X @JoshPurtell @sheriyuo @MikaStars39

This might seem like a small thing, but it's like a needle that pricks a consensus that has been around for over a year. GRPO (Group Relative Policy Optimization) was proposed by DeepSeek in the DeepSeekMath paper in 2024 and verified by DeepSeek-R1. Since then, it has almost become the default choice for training inference models in the open-source community - it can train models with strong inference capabilities without a value network. GLM-5.1 used this approach during its reinforcement learning phase. More than a year later, GLM-5.2 quietly replaced it.

One of the earliest adopters of a proven paradigm is quietly abandoning it.

Reactions in the Tech Community

After the news spread, discussions on X quickly branched into several lines.

Some people called this "the return of the critic". Developer @hallerite's judgment was straightforward: The method of reducing variance through in-group comparison doesn't work after a certain task length. The model needs more fine-grained signals. OpenAI and Anthropic probably started using value networks a long time ago.

There were many similar posts. Some said they compared GRPO and actor-critic in small-scale projects, and actor-critic performed significantly better. Others suspected that leading labs like OpenAI and Anthropic never really relied on GRPO for long-range tasks. This is just a wall that long-range tasks will eventually hit. For example, @ethayarajh pointed out that the PPO route, which was rejected by NeurIPS, is actually closer to the "bitter lesson" often mentioned in the reinforcement learning community - methods that are sufficiently general and can scale with computational resources tend to go further than those with delicate structures but limited applicability.

Xiuyu Li reminded that some teams that have been working on long-range task training have never fully adopted GRPO. PPO or even REINFORCE has always been their main approach.

In the academic community, it's a different story: Variants like GSPO, DAPO, Dr.GRPO, GMPO, and CISPO are still emerging continuously, trying to polish off the inefficiencies and instabilities of GRPO.

The industry is quietly turning back, while the academic community is charging forward. This contrast is quite interesting.

Why Zhipu Replaced GRPO

To understand this switch, we first need to figure out what problem GRPO was originally designed to solve.

Traditional PPO requires a value network (critic) to specifically predict "how much reward can be obtained in the future from the current state", which is used to calculate the advantage value for each action. This network is as large as the policy model, expensive to train, and prone to instability.

GRPO's solution is: Stop training the value network. Instead, have the model generate a group (usually dozens) of responses to the same question, and use the average reward within the group as the baseline. If an answer's reward is higher than the group average, its advantage value is positive. It's like having dozens of students submit their answers to the same question at the same time and then comparing their scores with each other - there's no need for an all-knowing grader, and the best can still be selected from the group.

For short tasks with clear right or wrong answers, such as math problems and unit tests, this method saves memory and is stable. After DeepSeek-R1, it almost became the default choice in the open-source community.

GLM-5.1 used this approach during its reinforcement learning phase, with a fixed group size of 32.

However, GLM-5.2 is targeting a different type of problem: long-range agent tasks. According to the content disclosed in Zhipu's technical blog, the execution trajectory of these tasks is much longer than solving a math problem, involving multiple rounds of tool calls, sub-task decomposition, and multi-round environmental feedback. After compaction, the number and length of sub-trajectories can vary greatly.

This exactly hits GRPO's weakness: It requires comparing a group of outputs for the same question. However, the sub-trajectories compressed from long-range tasks have different lengths. Some are brief, while others are dozens of steps long. It's impossible to form a group of samples for fair comparison. If we continue to force in-group comparison, a large amount of data becomes unusable.

Zhipu's solution is to bring back the value network. GLM-5.2's long-range reinforcement learning has shifted from "group relative optimization" to "critic-based PPO", using token-level advantage values to adapt to sub-trajectories of different lengths - no longer relying on a group of peers to score each other, but retraining a "grader" that can independently evaluate any trajectory.

Image source: Tweet by Chen Deli of DeepSeek

In conjunction with this change, Zhipu used the slime framework to integrate training and large-scale inference rollout, and parallelly distilled more than a dozen expert models into the final model in just about two days. To address the common issue of reward cheating in coding tasks (such as directly pulling the reference answer via curl or searching for hidden test case files using grep), GLM-5.2 introduced a two-stage interception mechanism. First, it uses rules for filtering, and then an LLM referee to identify suspicious tool calls. After interception, it returns a meaningless "false message" to keep the training trajectory going instead of abruptly interrupting it, to avoid training instability.

In short, GLM-5.2 doesn't deny GRPO. Instead, it found that the design premise of GRPO doesn't hold for long-range agent tasks.

Is GRPO Really Outdated?

Simply summarizing this switch as "GRPO doesn't work" might be a lazy conclusion.

GRPO became popular because it solved a very specific problem: performing reinforcement learning on verifiable tasks with clear right or wrong answers using as little memory as possible and in a stable way. It still does this very well. For short tasks like math problems, code unit tests, and format checks, the answers are within the sampled group, and the cost advantage of in-group comparison still holds. That's why variants like GSPO and DAPO are still continuously refining GRPO's performance in MoE training and long thought-chain scenarios, rather than simply declaring it obsolete.

A more telling example is the creator of GRPO itself. The technical report of DeepSeek V4 released in April this year shows that DeepSeek still used GRPO when training domain-specific expert models for math, code, agents, and instruction following. It only switched to a new method called "On-Policy Distillation" when merging multiple experts into a unified model.

What GLM-5.2 actually replaced is the applicability of GRPO for another type of task (multi-round, long-range, agent tasks with sparse and delayed rewards). In such tasks, "how well this step is done" often can only be inferred from the final result dozens of steps later. Moreover, the lengths of the task trajectories vary greatly, making it difficult to find a group of "identical-condition" samples for in-group comparison.

This judgment is not just an industry experience. There are also academic control experiments to support it.

A paper titled "Learning Without Critics? Revisiting GRPO in Classical Reinforcement Learning Environments" published at the end of last year specifically conducted tests: In long-range tasks without a premature termination mechanism, methods without a critic consistently performed worse than PPO with a learned value function. Only in short-range tasks like CartPole could the in-group comparison method achieve comparable results.

This conclusion and GLM-5.2's choice this time are the same judgment reached from two completely different directions: industrial practice and academic experiments.

So a more accurate statement might be: The choice of reinforcement learning algorithms is becoming task-dependent, and there is no longer a "default option" that works for all scenarios.

For short-range verifiable tasks, GRPO and its variants are still sufficient and cost-effective. For long-range agent tasks, the value network becomes important again.

The discussions triggered by GLM-5.2 are significant because it has for the first time presented this dividing line in a public technical blog, turning a judgment that was only a rumor in a small circle (leading labs might not rely on GRPO for long-range tasks) into an open-source, reproducible, and verifiable reference sample for the outside world.

Conclusion

In the past two years, GRPO has almost become synonymous with the reinforcement learning phase of open-source large models, a "cheap and effective" default belief. GLM-5.2's choice reminds us that this belief has its limitations - it was born in the world of math problems and unit tests, while today's agents are being pushed towards real tasks that require continuous work for hours or even longer.

For the entire industry, the significance of this switch might exceed that of the 1M context or the benchmark scores themselves. It shows that as open-source models evolve from "test-takers" to "working agents", the algorithm selection in the post-training phase also needs to evolve with the task form, rather than staying within the paradigm set by a single paper.

No one can predict where the next paradigm shift will occur, but one thing is certain: the debate about the future direction of reinforcement learning has just begun.

This article is from the WeChat official account