Can RL Training Produce a “Question Predicting Master”? Solving the Diversity Crisis and Catastrophic Forgetting in Model Fine-tuning
After RL, why do large models tend to become more "homogeneous" during training? Facing a variety of improvement ideas, perhaps the answer is not complicated: try to modify the KL term first.
In recent years, Reinforcement Learning with Verifiable Reward (RLVR) has become an important path to improve the reasoning ability of large language models.
From mathematical problem - solving to code generation and then to SQL inference, a large number of studies have shown that RL can significantly improve the success rate of models in single - answer scenarios.
However, a key phenomenon has not been fully explained: Why do many models fine - tuned by RL, although the Pass@1 has increased, but the Pass@k decreases when multiple attempts are allowed?
This indicates that the model may be better at "hitting the correct answer once" but loses the original rich problem - solving paths and candidate solution spaces. Furthermore, this phenomenon is often accompanied by catastrophic forgetting and a decline in cross - domain generalization ability.
Existing methods usually focus on reward design, sampling strategies, or entropy regulation. However, the research team found that a more fundamental and crucial problem has been ignored for a long time: How should the divergence term in the RL objective be selected?
To address this issue, a joint research team from Fudan University, Infinite Light - Year, Shanghai Institute for Scientific Intelligence (hereinafter referred to as SIIS), and Shanghai Chuangzhi College focused on the long - neglected KL divergence term and solved this problem from the perspective of divergence selection. The relevant research results have been accepted by ICLR2026.
Paper title: The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward
Paper link: https://arxiv.org/abs/2509.07430
Code link: https://github.com/seamoke/DPH - RL
Li Long, a doctoral student at Fudan University and an intern at Infinite Light - Year, and Zhou Yujian, a doctoral student at Fudan University and Shanghai Chuangzhi College, are the co - first authors. Qu Chao, a researcher at Fudan University and an AI scientist at Shanghai Institute for Scientific Intelligence, is the corresponding author.
The Dilemma of Divergence Selection: The Cost of Reverse - KL and the Lack of Constraints
In most post - training methods of RL, the common practice is to use reverse - KL or directly remove the divergence constraint. However, both of these choices have obvious drawbacks:
Reverse - KL is essentially mode - seeking, which encourages the strategy to shrink towards a few high - probability modes;
The lack of a divergence term means that the model lacks an explicit protection mechanism for the original knowledge distribution during the training process.
Both of these settings will cause the model to increasingly focus on a small number of "familiar answers", leading to a decline in Pass@k, forgetting of existing abilities, and a weakening of cross - task generalization ability. If expressed in a more formal way, traditional RLVR can be summarized as:
Among them, πθ is the current strategy, and πref is the reference strategy (usually the initial model or the SFT model). The key to the problem is that if the divergence here is selected improperly, the second half will no longer be a "protection mechanism" but will instead become a "diversity compressor".
If the base model is regarded as a "knowledge distribution" that has mastered a large amount of knowledge and diverse solutions, then the goal of RL fine - tuning should be to further improve task performance while retaining existing abilities.
However, in reality, many RL methods seem to be constantly strengthening a few high - reward trajectories. The model gradually favors one or two solutions that are easiest to get rewards and abandons other equally effective but less frequent paths.
The research team conducted an interesting experiment: through SFT, the model learned a variety of different response styles, and it was possible to judge which style the model used only by the prefix. However, after standard GRPO training, the model almost only retained one style.
Therefore, the research team believes that what really needs to be solved in RLVR is not only "how to learn stronger" but also: how to preserve the original diversity of the model while optimizing the reward.
Method: Reshape Divergence from a "Constraint Term" to a "Diversity - Preserving Mechanism"
Based on the above observations, the team proposed DPH - RL (Diversity - Preserving Hybrid RL). The core idea of this work is:
Divergence should not just be an incidental regularization term during training but should be redesigned as a mechanism to actively protect the diversity of the model.
Specifically, instead of using the traditional reverse - KL, a divergence with more mass - covering properties is introduced
, for example:
Different from reverse - KL, which tends to shrink to a single mode, this kind of divergence encourages the new strategy to continue to cover a variety of solutions originally existing in the reference strategy. In other words, it does not force the model to "only remember the optimal path" but reminds the model: "You can continue to get stronger, but don't forget what you originally mastered."
Mechanically, the method of this research can be understood as a rehearsal mechanism: the model continuously refers to the distribution of the initial strategy during the training process, thus retaining the original knowledge coverage and avoiding excessive contraction during the reinforcement learning process.
Taking the mentioned forward - KL as an example:
The expectation here is taken over the reference strategy πref. As long as the reference strategy has covered some reasonable solutions, the new strategy πθ cannot easily reduce their probabilities to near zero. Therefore, forward - KL has a stronger mass - covering tendency and is more suitable as a "diversity - preserving" tool.
Furthermore, the paper also introduces JS divergence as a more stable and symmetric alternative. If we denote:
, then the corresponding generating function can be written as:
. Thus, a smoother way of distribution constraint is obtained.
In addition, DPH - RL is also more efficient in implementation. The authors use the generator - function - based method to calculate f - divergence, only need to pre - sample from the initial πref, and do not need to maintain an online reference model during the training process.
This makes the method more friendly in terms of training cost and more suitable for actual large - scale post - training scenarios. During specific training, DPH - RL does not apply the same constraint to all samples in a "one - size - fits - all" manner, but first divides the data into two parts:
Exploration set Dexp: For difficult samples that the model has not mastered, no KL penalty is added,
allowing the model to more aggressively explore high - reward solutions on difficult samples. Here, the standard PPO - clip objective is used:
Near - perfect set Dpef: For samples that the model has basically mastered, πref samples from Dpef, and relies on f - divergence to maintain diversity on correct samples. More intuitively, the model no longer pursues "getting higher rewards" on this part of the samples, but tries not to deviate from the behavior distribution that has performed well originally. Its general form is:
Therefore, the overall training process is more suitable to be expressed in the form of "calculation according to different situations":
In other words, instead of superimposing the "exploration term + preservation term" on each sample at the same time, it first determines whether the sample belongs to Dexp or Dpef , and then calculates the corresponding loss.
A Better Divergence Selection Can Simultaneously Improve Performance, Preserve Diversity, and Enhance Generalization Ability
Experimental Setup
The paper uses Llama3.1 - 8b as the experimental model, trains only on the BIRD dataset, and tests the OOD generalization ability on the BIRD, Spider, and mathematical task datasets.
In - Domain Performance: Recovery of Pass@k
On the BIRD dataset, the results clearly show:
GRPO and DAPO may improve the Greedy (equivalent to Pass@1) performance, but their Pass@8 and Pass@16 scores are significantly lower than those of the Base Model, confirming the existence of diversity collapse;
RKL (Reverse - KL) also performs poorly, and the Pass@k decreases;
DPH - F and DPH - JS not only have the highest Greedy scores, but their Pass@8 scores also exceed those of the Base Model. Among them, the Pass@8 score of DPH - JS is 4.3% higher than that of GRPO. With a larger k setting, DPH - RL is closer to the base model, alleviating the collapse of Pass@k.
Cross - Domain and OOD Performance: Maintenance of Generalization Ability
We regard the Spider dataset on the SQL task as cross - domain and the mathematical dataset as out - of - domain. It can be seen that all RL models trained only on the SQL dataset Bird will experience varying degrees of performance decline when the distribution shifts.
As shown in the figure, as the difference between the