Rejecting "Entropy Collapse" and "Entropy Explosion": Research Enables Large Models for "Precise Exploration", Boosting Reasoning Performance

A new idea for training inference models has emerged.

The "entropy dilemma" faced by large language models in RLVR training has been solved!

Since 2024, large models represented by OpenAI o1, DeepSeek - R1, Kimi K1, Qwen3, etc. have made significant breakthroughs in mathematical, code, and scientific reasoning tasks. These advancements are largely due to a method called RLVR (Reinforcement Learning with Verifiable Rewards).

This method provides training signals through means such as mathematical verification and unit testing that can automatically determine right or wrong, replacing the traditional process that relies on human judgment. This enables the model to conduct large - scale and efficient self - improvement.

However, RLVR has always faced a key bottleneck in practice: the exploration mechanism is extremely prone to imbalance - either the exploration is restricted, leading to entropy collapse; or the exploration gets out of control, causing entropy explosion.

To break through this bottleneck, a research team from institutions such as the Shanghai AI Laboratory and Fudan University proposed the Selective Entropy Regularization Method (SIREN). Through the triple mechanism of defining the exploration scope, focusing on key decisions, and stabilizing the training process, it achieves precise regulation of exploration behavior.

Experiments have proven that this method not only achieves significant performance improvements on multiple mathematical reasoning benchmarks, but more importantly, it makes the model's exploration process more efficient and controllable.

Let's take a detailed look below:

Core Dilemma: The "Catch - 22" Trap of Exploration

In RLVR training, researchers expect the model to continuously explore diverse problem - solving paths to avoid prematurely falling into local optima.

A natural idea is to introduce entropy regularization.

This is a classic means of encouraging exploration in reinforcement learning. Its core idea is simple: add a term to the optimization objective to encourage the model to maintain a certain degree of "uncertainty" at each step of generation, rather than prematurely concentrating all probabilities on a few words.

Specifically, it calculates the entropy of the output distribution at each step (measuring the "degree of chaos") and then adds the average entropy of the entire reasoning trajectory to the training objective, using a coefficient 𝛽 to control the exploration intensity.

The following formulas are respectively: the formula for calculating entropy and the optimization objective of entropy regularization.

However, this strategy is prone to two extremes in the complex scenarios of large - scale reasoning models (LRM):

Restricted Exploration (Entropy Collapse)

When 𝛽 is too small, the entropy term hardly works, and the model quickly degenerates into an approximately deterministic strategy. The average entropy converges rapidly, that is, entropy collapse occurs. After a few rounds of training, all answers are highly similar, and the model falls into a "comfort zone". This entropy collapse phenomenon not only stifles the model's diversity but also causes its reasoning ability to reach the ceiling in the early stage of training, failing to fully unleash its potential.

Uncontrolled Exploration (Entropy Explosion)

Conversely, when 𝛽 is slightly larger, the model is extremely prone to losing control in a large action space (hundreds of thousands of tokens) and an extremely long reasoning trajectory (thousands of steps of generation). According to the definition of entropy, the flatter the probability distribution, the higher the entropy. In such a large vocabulary, even moving a little probability mass from high - probability words (such as "therefore") to meaningless words (such as "<", "#@$%") can bring about a significant increase in entropy.

Worse still, in autoregressive generation, this uncertainty will accumulate step by step along the trajectory - a slight chaos in the early steps will quickly amplify into a loss of control of the entire reasoning chain. Eventually, in order to "increase the entropy", the model allocates a little probability to each token at each position, resulting in the generated content being filled with meaningless symbols, broken logic, and collapsed semantics - this is a typical entropy explosion.

The fundamental reason why traditional methods fail is that the incentive of entropy regularization is "indiscriminate" - it assumes that all tokens and all positions are equally worthy of exploration. However, the generation process of LRM has a distinct structure:

At each generation step, only a few tokens with high probability rankings are semantically reasonable, while the probabilities of most other tokens approach zero and are meaningless;

In the entire generation sequence, only a few keywords that play the role of logical hubs (such as logical connectives, variable names, conclusion - guiding words) truly affect the reasoning direction, while a large number of regular words used for syntactic filling should maintain high certainty to maintain the coherence of reasoning.

Because it ignores this "non - uniform distribution of exploration value", traditional entropy regularization not only has difficulty effectively guiding exploration but also easily causes training instability and even deviates from the original intention of improving reasoning ability.

The following figure shows that before training, the probability distribution of the model is highly concentrated, and only a small number of positions are logically critical and worthy of exploration; after excessive exploration, the probability is spread out, and the generated content becomes chaotic.

Way Out: Installing a "Precise Navigation" for Exploration

To address the shortcomings of traditional methods, researchers proposed the Selective Entropy Regularization Method (SIREN), which achieves fine - grained regulation of the exploration process through structured constraints. SIREN includes three core mechanisms:

1. Define the Exploration Scope (Top - p Mask)

At each generation step, strictly limit the scope of entropy calculation to the core token set with the highest probabilities, ensuring that exploration is only carried out among semantically reasonable candidate words and avoiding ineffective exploration.

2. Identify Key Decision Points (Peak - entropy Mask)

Automatically identify logical keywords (such as reasoning connectives, hypothesis - guiding words, etc.) whose entropy values are significantly higher than the average level in the generation sequence, and concentrate the exploration incentive on these key positions.

3. Stabilize the Training Process (Self - anchored Regularization)

Adjust the entropy target from maximization to maintaining a reasonable range. Through a dynamic anchoring mechanism, keep the exploration intensity within a controllable range to avoid training instability.

This method achieves triple precise control of the exploration scope, position, and intensity for the first time in the RLVR framework, providing a reliable solution for the stable training of large - scale reasoning models.

The following figure shows the process of the SIREN method:

Experimental Verification: Effective Exploration Promotes Performance Improvement

Experimental results show that SIREN achieves significant improvements on different models and datasets.

The following are the experimental results of SIREN on Qwen2.5 - Math - 7B:

And the experimental results of SIREN on other base models:

The above results show that:

On Qwen2.5 - Math - 7B, the average maj@k of SIREN reaches 54.6%, exceeding the strongest baseline by 4.8%.

On the most challenging AIME24/25, the improvement reaches 6.6%.

It is stably effective on models of different scales from 1.5B to 8B and different base models.

So, where do these performance improvements come from?

Analysis shows that this is exactly the fundamental change brought about by effective exploration. Compared with traditional entropy regularization methods, SIREN shows a more reasonable and effective exploration mode.

In the following figure, SIREN shows a higher pass@k, and the exploration boundary is significantly expanded:

It can also avoid perplexity collapse and maintain good answer diversity:

The following figure shows that by first increasing exploration and then slowly converging, the training process is smooth and controllable:

Summary

This research aims to solve the problem of strategy exploration faced by large language models in RLVR training.

Through systematic empirical analysis, researchers found that traditional exploration mechanisms are extremely prone to imbalance in large - scale action spaces and long - sequence generation, causing the model to fall into the dilemmas of entropy collapse and entropy explosion.

To break through this bottleneck, the team proposed the Selective Entropy Regularization Method (SIREN). Through the triple mechanism of defining the exploration scope, focusing on key decisions, and stabilizing the training process, it achieves precise regulation of exploration behavior. Experiments have proven that this method not only achieves significant performance improvements on multiple mathematical reasoning benchmarks, but more importantly, it makes the model's exploration process more efficient and controllable.

The team said that looking to the future, as reinforcement learning becomes the mainstream method for post - training of large models, how to achieve stable, controllable, and efficient exploration will become the core issue for unleashing the potential of large models and breaking through performance bottlenecks. The selective exploration regulation mechanism proposed in this research provides a feasible solution for the refinement of exploration.

The team hopes that this work can inspire the training paradigm of the next - generation reasoning models and promote large models to go further in complex tasks such as mathematics, code, and scientific reasoning, as well as in other broader application fields.

Paper link: https://arxiv.org/abs/2509.25133

Project homepage: https://github.com/Linn3a/siren

This article is from the WeChat public account "Quantum Bit". The author is the SIREN team. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Rejecting "entropy collapse" and "entropy explosion", this research enables large models to learn "precise exploration", leading to a significant surge in reasoning performance.

Core Dilemma: The "Catch - 22" Trap of Exploration

Way Out: Installing a "Precise Navigation" for Exploration

Experimental Verification: Effective Exploration Promotes Performance Improvement

So, where do these performance improvements come from?

Summary