DeepSeek GRPO Achieves Disruptive Improvement: Lengthy Responses Cut by 80%, Microsoft's GFPO Introduced

Microsoft's new algorithm, GFPO, significantly reduces inference redundant tokens by up to 80%.

People who have used inference models such as DeepSeek - R1 have probably encountered this situation: when faced with a slightly tricky question, the model rambles on as if lost in thought, consuming time and computing power, yet the results may not be reliable. Now, we may have a solution.

In the past two days, Microsoft researcher Dimitris Papailiopoulos revealed a new achievement on X: Group Filtered Policy Optimization (GFPO) - a revolutionary reinforcement learning algorithm.

GFPO can balance the computational overhead in both the training and testing phases. While improving accuracy, it can reduce the redundant token length caused by reinforcement learning in inference by up to 80%!

The data is astonishing, but how is this achieved?

Just now, GFPO was finally uploaded to arXiv, and all details were made public for the first time. The new way of efficient reinforcement learning is about to be revealed.

Paper title: Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning
Paper address: https://arxiv.org/abs/2508.09726

To understand GFPO, first understand GRPO

Before introducing GFPO, it is necessary to take a look at the Group Relative Policy Optimization (GRPO) proposed by DeepSeek.

GRPO is based on the Proximal Policy Optimization (PPO) algorithm but simplifies it. That is, it no longer needs to use a value model to estimate the baseline advantage. The specific operation is to sample multiple responses for each question and use their average reward as the baseline. The optimization objective is still a clipped surrogate objective similar to PPO.

If we let θ represent the model parameters, q represent the question, and o represent the response sampled from the old policy π_θ_old, then the GRPO objective can be written as:

It should be noted that although the standard GRPO loss normalization formula is shown here, multiple open - source reinforcement learning libraries, including verl and TRL, default to using DAPO token - level loss normalization for GRPO, which is also the method used by the Microsoft team in their experiments.

A key limitation of GRPO is that it relies on a single scalar reward signal, which makes it difficult to jointly optimize multiple desired response attributes, such as conciseness and accuracy. As a result, GRPO can indeed improve accuracy, but it also significantly increases the response length.

GFPO is designed to solve this problem. It can optimize multiple response attributes simultaneously.

Group Filtered Policy Optimization: GFPO

GFPO is a simple and effective method for targeted policy optimization of desired response attributes.

GFPO samples a larger candidate response group for each question, thereby expanding the response pool to include more candidate responses with the desired characteristics, and then explicitly filters these characteristics when calculating the policy gradient. Although it seems natural to directly encode desired attributes such as conciseness or informativeness into a scalar reward, encoding multiple characteristics simultaneously can be difficult, especially when correctness must be guaranteed.

Data filtering is an implicit and flexible form of reward shaping - similar to the iterative self - improvement method that uses selective sampling to amplify specific model behaviors. After this explicit filtering step separates the desired responses, the standard reward is used within the selected group to calculate the relative advantage. Therefore, GFPO can optimize multiple desired attributes (such as length and accuracy) simultaneously without complex reward engineering.

Since the goal here is to reduce the inflation of response length in reinforcement learning, the team mainly studied using GFPO to optimize and shorten the response length while maintaining an accuracy comparable to GRPO.

Given a question q, a large number of responses G = {o_1, ..., o_G} are sampled from the current policy. GFPO does not train equally on all responses. Instead, it applies a selection step based on user - specified metrics, filters out a subset of the most desirable responses of size k, and then conducts training. After that, an index score is calculated for each response and sorted accordingly. The top k responses are selected to form a retained subset S ⊆ G (Algorithm 1). Here, the team defines a binary mask m ∈ {0, 1}^G, where m_i = 1 represents a selected response and m_i = 0 represents a rejected response.

Here is the formal definition of GFPO:

Here, the average (μ_S) and standard deviation (σ_S) of the response - level rewards in S are used to normalize the advantages of the responses in the selected subset S. In this way, the responses that have shown the desired attributes can be meaningfully compared, ensuring that GFPO prioritizes the responses with the highest rewards in the filtered subset. The advantages of responses not in S are zero, effectively excluding them from the policy update.

Therefore, the main intervention of GFPO is at the advantage estimation level, making it compatible with any GRPO variant, such as DAPO, Dr. GRPO, or GRPO with Dual - Clip PPO loss.

Although GFPO leads to higher computational costs in training time by sampling more responses, this cost can be offset because the learned policy can produce shorter responses than GRPO.

Although GFPO is general and can adapt to various scoring metrics, the Microsoft team studied the metrics aimed at reducing the inflation of response length in their experiments here:

Response length: Training with short responses directly encourages conciseness.
Token efficiency (reward/length): Training with responses with high token efficiency encourages conciseness, but longer responses can still be allowed if they can "justify" themselves.

Other metrics (such as factuality, diversity, or external quality scores) can also be integrated into GFPO to optimize different target attributes.

GFPO with Adaptive Difficulty

The team also proposed a GFPO variant: Adaptive Difficulty GFPO, as shown in Algorithm 2. Its goal is to allocate more training signals to more difficult questions.

At each training step, the difficulty of the question is estimated by calculating the average reward of the responses sampled for each question - a lower average reward means higher difficulty.

To adaptively adjust the number of retained responses (k), the team used a lightweight t - digest data structure to maintain a streaming summary of the difficulty of the prompts. The t - digest can effectively approximate the quartiles of the difficulty (mean reward) of all prompts so far, enabling the classification of new questions into relative difficulty buckets.

Based on this classification, the team assigns a target number of retained responses k to each question: 4 for easy questions, 6 for medium - difficulty questions, and 8 for difficult and very difficult questions (selected from 16 samples). This dynamic curriculum can more aggressively filter easy prompts and conduct more exploration on difficult prompts. The number of difficulty buckets and the k value for each bucket are hyperparameters of this method.

Adaptive Difficulty GFPO can efficiently utilize training computation by concentrating gradient updates where they are most needed. It helps the model reduce the verbosity of simple examples (where the accuracy is already high) while maintaining the accuracy of more difficult prompts by retaining more reasoning chains.

The team said: "To our knowledge, this is the first algorithm that can dynamically adjust the effective group size according to the question difficulty."

Experimental findings based on GFPO

So, how does GFPO perform? Based on the 14B - parameter Phi - 4 - reasoning model, the team conducted experiments.

They evaluated three GFPO variants:

Shortest k/G: Retain the k shortest responses in G, while changing k and the group size G to study their impact on length reduction.
Token efficiency: Retain the k responses with the highest token - reward efficiency in G, using k = 8 and G = 16 (consistent with the baseline Shortest k/G setting).
Adaptive Difficulty: Retain the k shortest responses in G, where k is dynamically selected according to real - time difficulty estimation (4, 6, 8, with 8 representing from easy to very difficult), and G = 16.

For more experimental details, please refer to the original paper. Here, we focus on some of the findings obtained by the team.

Finding 1: "Think less" requires more sampling: Reducing the retained responses without increasing the group size (Shortest 6/8 GFPO) does not reduce the response length.

Finding 2: The percentage of retained responses (k/G) can control the length pressure: Reducing k or increasing G further shortens the length. The team observed that retaining 25 - 33% of the responses is optimal, and the smaller the retention ratio, the smaller the gain. Shortest 4/24 is the GFPO variant for optimal length optimization, which can minimize overly long responses.

Finding 3: Token efficiency (reward/length) optimization brings the largest reduction: While maintaining accuracy, the additional length is reduced by 70.9% (AIME 25), 84.6% (AIME 24), 79.7% (GPQA), 82.6% (OmniMATH), and 79.7% (LiveCodeBench). These reductions slightly increase the variance during the training process.

Finding 4: Adaptive Difficulty GFPO outperforms the Shortest - k algorithm with the same computational cost: By adaptively determining the k value according to the question difficulty, it achieves better length reduction in 4 out of 5 benchmark tests compared to the Shortest - k algorithm with the same computational cost.

Finding 5: GFPO can alleviate the out - of - distribution (OOD) length inflation: GRPO increases the response length of out - of - distribution tasks without improving accuracy, while GFPO suppresses this inflation while slightly improving accuracy.

Finding 6: GFPO shortens responses at all difficulty levels. Token - efficiency GFPO achieves the greatest reduction in easy, medium, and difficult questions - on easy questions, its responses are even shorter than those of the SFT model, while maintaining an accuracy comparable to GRPO. Shortest 8/24 GFPO achieves the greatest reduction in the most difficult questions due to its strong filtering ability.

Finding 7: Adaptive Difficulty GFPO surpasses GRPO in accuracy on medium - difficulty and extremely difficult questions, while shortening overly long questions by 47% - 60%. A larger group size improves the accuracy of difficult questions: Adaptive Difficulty (k = 8, G = 16) slightly decreases on difficult questions, but the Shortest 8/24 algorithm can find concise and correct responses through more sampling, achieving an accuracy comparable to GRPO.

Finding 8: Even at a fixed

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The lengthy responses are reduced by 80%. DeepSeek GRPO has achieved a disruptive improvement, and Microsoft's GFPO has been introduced.

To understand GFPO, first understand GRPO

Group Filtered Policy Optimization: GFPO

GFPO with Adaptive Difficulty

Experimental findings based on GFPO