HomeArticle

Does DeepSeek's GRPO cause the model to crash? Take a look at the new paradigm GSPO of Qwen3.

机器之心2025-08-07 18:09
The sequence-level importance sampling of GSPO may replace GRPO and become the new standard.

As is well known, the training of large language models typically consists of two stages. The first stage is "pretraining." Developers train the model using a large-scale text dataset to enable it to predict the next word in a sentence. The second stage is "post-training," which aims to teach the model how to better understand and execute human instructions.

During the post-training stage of LLMs, it seems to be a special form of reinforcement learning. The reinforcement learning (RL) algorithms used for fine-tuning large language models (LLMs) are continuously evolving along a clear path.

Initially, OpenAI pioneered a technique called Reinforcement Learning from Human Feedback (RLHF) to improve ChatGPT. The core of RLHF is to have human annotators score multiple responses generated by the model and select the optimal answer as a reference for training. Although this process is effective, it is also time-consuming, expensive, and reliant on human labor, usually requiring a small but professional data annotation team.

DeepSeek's significant innovation lies in automating this process using RL technology. Instead of relying on manual evaluation one by one, the algorithm allows the model to learn correct behaviors autonomously by receiving "reward signals" during the exploration process, thereby significantly reducing costs, improving efficiency, and ultimately achieving high performance at a lower cost.

OpenAI adopted Proximal Policy Optimization (PPO) in the training of ChatGPT.

The DeepSeek team believes that value estimation within a group of samples is more effective. Therefore, they proposed the Group Relative Policy Optimization (GRPO) algorithm, which is also the core technology in the DeepSeek-R1 model, making the DeepSeek-R1 model stand out.

Comparison between GPRO and PPO, excerpted from the DeepSeekMath paper.

When Qwen3 was first introduced a few months ago, the performance of its flagship model was already comparable to that of top models such as DeepSeek-R1, o3-mini, and Gemini 2.5 Pro. In addition, the Qwen3 series of models covers both Mixture-of-Experts (MoE) models and dense models, and each model has many segmented versions.

Recently, the Qwen3 series of models have been continuously updated. For example, Qwen3-235B-A22B-Instruct-2507-FP8 performed excellently in numerous evaluations such as knowledge mathematics, programming, human preference alignment, and Agent capabilities, even surpassing top open-source models like Kimi-K2 and DeepSeek-V3, as well as leading closed-source models like Claude-Opus4-Non-thinking.

Recently, the Qwen team published a paper on its model's post-training algorithm, seemingly revealing the core technical details behind the success of the Qwen3 model.

Paper title: Group Sequence Policy Optimization

Paper link: https://huggingface.co/papers/2507.18071

Blog link: https://qwenlm.github.io/blog/gspo/

Yesterday, NetMind.AI, a startup founded by alumni of Tsinghua University, published a blog titled "Qwen Team Proposes GSPO for Qwen3, Claims DeepSeek's GRPO is Ill-Posed," providing a detailed introduction and analysis of the GSPO algorithm proposed by the Qwen team for the Qwen3 model.

Blog link: https://blog.netmind.ai/article/Qwen_Team_Proposes_GSPO_for_Qwen3%2C_Claims_DeepSeek's_GRPO_is_Ill-Posed

Recent research by Qwen indicates that there are serious stability issues when training large language models using GRPO, often leading to irreversible model collapse. They believe that DeepSeek's GPRO method has some serious problems:

Applying importance sampling at each token level will accumulate high variance in long sequences, leading to unstable training.

This problem is particularly severe in Mixture-of-Experts (MoE) models because token-level routing changes will exacerbate instability.

To alleviate this problem, the training process based on GRPO usually needs to rely on some additional strategies, such as Routing Replay.

Therefore, the Qwen team claims that the token-level importance sampling of GRPO cannot achieve stable training, and its optimization objective is "ill-posed."

To solve these problems and train its latest Qwen3 series of models, the Qwen team proposed a new reinforcement learning algorithm - Group Sequence Policy Optimization (GSPO).

Fundamental problems of GRPO:

Instability of "token-by-token importance sampling"

The Qwen team pointed out that the instability of GRPO stems from its incorrect use of token-level importance sampling weights. In reinforcement learning, Importance Sampling is used to correct the difference between the behavior policy (the policy used to collect training data) and the target policy (the policy currently being optimized).

When the two are inconsistent, Importance Sampling assigns weights to existing data samples to make them more representative of the target policy that one hopes to optimize, thereby improving the stability and effectiveness of training.

In the training of large language models (LLMs), reinforcement learning often reuses responses generated by the old policy to save computational resources, which is a typical "off-policy" training scenario. Importance Sampling is used to mitigate the impact of this policy mismatch and help stabilize the training process.

However, GRPO applies the importance sampling weights to each individual token rather than the entire generated sequence. This approach introduces significant variance and causes "error accumulation" and "training instability" when generating long sequences.

Formally, GRPO calculates the importance weights separately at each token generation step:

The Qwen team pointed out that when applying such importance weights in the training objective, since the ratio of each token is calculated independently, it will lead to the accumulation of high variance, which disrupts the gradient stability and ultimately causes the model to collapse.

Meanwhile, this approach introduces high-variance noise into the training gradients, especially showing an accumulative effect in long sequences. And when there is a "clipping mechanism," this instability problem will be further exacerbated.

Experimental evidence from the Qwen team

The Qwen team verified its theoretical analysis through experimental evidence, as shown in the figure.

In all the experimental scenarios presented, the newly proposed GSPO algorithm outperformed GRPO in terms of training efficiency. In the CodeForces task, the final score of GRPO converged below 2000 points, while GSPO continued to improve its performance as the training computational volume increased, demonstrating stronger "scalability."

Comparison of training curves between GSPO and GRPO

Qwen's solution:

"Sequence-level importance sampling"

So, how does GSPO solve the above problems?

As its name implies, the core of GSPO is to shift importance sampling from the token level to the sequence level. Its importance ratio is calculated based on the likelihood of the entire sequence:

This design of sampling weights naturally alleviates the problem of variance accumulation at each token, thereby significantly improving the stability of the training process.

It should be noted that the factor in the exponent is used for "length normalization." Without length normalization, the likelihood change of just a few tokens could cause drastic fluctuations in the sequence-level importance ratio, and different lengths of generated responses would also require different clipping ranges in the objective function, which would further increase the instability of training.

Advantages verified by experiments:

Simplify the training of MoE models

Special experiments conducted on Mixture-of-Experts (MoE) models further highlight the advantages of GSPO.

Due to the sparse activation characteristics of MoE models, this will further exacerbate the instability during the training process when using GRPO. After one or more gradient updates, the expert networks activated by the same response may change significantly.

When the Qwen team trained the 48-layer Qwen3-30B-A3B-Base model using GRPO, they found that: after each gradient update in reinforcement learning, for the same rollout samples, about 10% of the experts activated by the new policy were different from those activated by the old policy. This actually means that after each gradient update, you are training different models with different data samples. Undoubtedly, this is an extremely inefficient training method.

Before introducing GSPO, to alleviate this problem, they even adopted a technique called "Routing Replay," which forces the target policy to activate the same expert networks as the old policy.

In contrast, GSPO can achieve stable convergence without using Routing Replay, thereby eliminating unnecessary training complexity and retaining the full potential of the MoE architecture.

The Routing Replay strategy plays a crucial role in the normal convergence of GRPO training for MoE models

Conclusion:

GSPO may become the new standard

In summary, the GSPO method has two innovations:

It upgrades importance sampling from the token level to the sequence level and normalizes it by sequence length;

Significantly reduces variance and eliminates the need for auxiliary strategies such as "routing tricks" (e.g., Routing Replay);

The industry has generally reached a consensus that introducing reinforcement learning during the post-training stage of large language models is crucial for improving their inference ability.

A large number of experimental results in the paper further confirm that the "token-by-token importance sampling" method used by GRPO has problems of instability and inefficiency.

Therefore, the "sequence-level importance sampling" proposed by GSPO is likely to become the new standard for post-training reinforcement learning in the future.

Reference links:

https://www.reddit.com/r/MachineLearning/comments/1mj3t3r/d_gspo_qwen3s_sequencelevel_rlhf_method_vs_grpo/

https://blog.netmind.ai/article/Qwen_Team_Proposes_GSPO_for_Qwen3%2C_Claims_DeepSeek's_GRPO_is_Ill-Posed

https://www.ft.com/content/ea803121-196f-4c61-ab70-93b38043836e?utm_source=chatgpt.com

https://zhuanlan.zhihu.com/p/22845155602

This article is from the WeChat public account "Machine Hearts," and is published by 36Kr with permission.