Farewell to pure reward-based trial and error. With second attempts and reflective distillation, performance in complex tasks improves by 81%.
[Introduction] Reinforcement learning has become one of the core methods in the post-training phase of large models. However, a long-standing problem remains unsolved: feedback in the real environment is often sparse and delayed, making it difficult for models to infer how to adjust their behavior from simple reward signals.
From a more macroscopic perspective, the learning methods of large models are undergoing a clear evolutionary path.
Early supervised fine-tuning (SFT) mainly relied on fixed examples for imitation learning, which could help models learn and reproduce the patterns in the examples. However, this method highly depends on manual data, which not only struggles to cover various situations in complex environments but also fails to support the continuous self-evolution of models.
Subsequently, reinforcement learning based on verifiable rewards (RLVR) brought models into the interactive environment, optimizing through trial and error with reward signals, enabling models to gradually improve strategies in dynamic tasks. However, this method still mainly relies on scalar rewards. Models need to implicitly infer how to correct behavior from sparse or delayed feedback, often resulting in low exploration efficiency and unstable training processes.
In contrast, humans often go through a cycle of "experience - reflection - retry" when facing complex tasks. After receiving feedback, individuals actively analyze the reasons for failure, summarize experiences, and apply corrective strategies in the next attempt, rather than simply repeating trial and error based on the outcome until success.
Recently, research teams from the University of Southern California and the University of Pennsylvania jointly proposed a new training paradigm - Experiential Reinforcement Learning (ERL), attempting to introduce the idea of "experiential learning" into the reinforcement learning process, enabling models not only to optimize behavior through trial and error but also to reflect and internalize experiences into strategies.
Paper link: https://arxiv.org/abs/2602.13949
ERL attempts to explicitly introduce this experiential learning cycle into the training process. After receiving a task, the model first makes an attempt, then generates self-reflection based on the environmental feedback, makes a second attempt based on the reflection, and internalizes successful behaviors into the basic strategy.
Figure 1: By introducing the "experience - reflection - internalization" cycle, ERL advances reinforcement learning from simply relying on reward signals to an experience-based learning method, enabling more direct behavior correction compared to supervised fine-tuning and traditional reinforcement learning.
From a mechanism perspective, traditional reinforcement learning mainly relies on the trial-and-error process and scalar reward signals for optimization.
In this process, a large amount of feedback information originally contained in the environment is often compressed into a simple reward value, such as success or failure, and many details that can help understand the reasons for errors are difficult to utilize.
At the same time, traditional methods usually lack a mechanism for accumulating experience across rounds. Each interaction is more like a relatively independent exploration process. Models can only gradually approach the correct strategy through continuous trial and error, which also makes the learning process often inefficient and unstable.
In contrast, ERL attempts to directly use the information in the feedback to generate reflections and continuously retain effective strategies through the experience internalization mechanism, enabling behavior improvement to accumulate in subsequent tasks, thus forming a more stable learning process.
Figure 2: Traditional reinforcement learning mainly relies on repeated trial and error for exploration, while ERL analyzes failures and corrects strategies through a reflection mechanism, enabling continuous accumulation of behavior improvement.
Second Attempt Mechanism and Experience Internalization
Under the ERL framework, each training round includes three key generation steps: the first attempt, reflection, and the second attempt.
The model first generates a first answer based on the input task and interacts with the environment to obtain feedback and the corresponding reward signal. Then, it generates a reflection based on this attempt and its feedback to summarize possible improvement directions. Finally, the model makes a second attempt based on the reflection and obtains new results and rewards (Figure 3).
During the training process, the outputs generated by these three steps will all participate in the regular reinforcement learning strategy update, but their corresponding reward sources are different. The first and second attempts directly use the reward signals obtained from their interactions with the environment, while the reward for the reflection itself is bound to the second attempt - if the reflection can help produce better results, it will receive a higher reward.
This design essentially transforms "whether the reflection is effective" into a learnable signal, enabling the model to gradually learn to generate more helpful reflection content.
Meanwhile, ERL also introduces an additional "experience internalization" step to transform the improvements brought by reflection into the model's ability to be directly used during inference.
The specific approach is: when the second attempt receives a high reward, an additional distillation target will be added to the training, allowing the model to directly generate an improved answer from the original input without providing the reflection context.
This process is essentially a context distillation, whose role is to "write" the behavior corrections obtained through reflection into the basic strategy, enabling the model to reproduce the improvement effect without explicit reflection during deployment.
Figure 3: Schematic diagram of the ERL training process
Overall, this mechanism introduces reflection into the reinforcement learning trajectory, enabling the model to complete local behavior correction within the same round. At the same time, it precipitates effective experiences into long-term capabilities through distillation, thus forming a closed-loop learning process of "generate - reflect - improve - internalize".
Significantly Improve Performance in Complex Environments
The paper verifies the effectiveness of ERL on three types of classic tasks, including two sparse reward environments, Frozen Lake and Sokoban, and the multi-hop question-answering task, HotpotQA.
It should be noted that the Frozen Lake and Sokoban environments in the paper are not the common text versions in traditional language model evaluations. In many existing settings, models are usually explicitly informed of the meanings of environmental symbols, rules, or task structures, while this study deliberately does not provide such prior information.
The model can only obtain observation results and reward signals through interaction with the environment, infer the semantics of symbols, the consequences of actions, and the task goals by itself, and gradually form strategies.
This design is closer to the real unknown environment, aiming to evaluate the model's ability to learn and self-improve through experience in the absence of prior knowledge. The results show that ERL outperforms the traditional RLVR method on all tasks (Figure 4).
Figure 4: Comparison of the final performance between ERL and RLVR
Among them, in the Sokoban environment that requires long-term planning and strategy reasoning, the performance improvement is the most obvious, with a maximum improvement of up to 81%. Frozen Lake also achieves an improvement of about 27%, and in tasks like HotpotQA with relatively more dense feedback and simpler environments, the improvement is about 11%.
The researchers point out that this result indicates that ERL has more prominent advantages in scenarios that require inferring environmental dynamics and long-term decision-making.
Faster Converging Training Dynamics
From the training curve, ERL maintains a higher reward level throughout the training process and converges faster overall under the same training budget, continuously widening the gap with the traditional RLVR method (Figure 5).
This is particularly evident in sparse reward and long-planning environments: when rewards are only given at the end, strategy gradient updates that purely rely on scalar rewards often require a large number of effective trajectories to produce stable improvements. ERL, by introducing the "failure - reflection - retry" structure within the same round, transforms the feedback information in one interaction into executable correction directions.
The paper believes that reflection provides an additional intermediate error-correction channel during training, enabling the model not to rely entirely on sparse final rewards to infer the direction of behavior improvement. Instead, it can generate clearer correction clues after receiving feedback and use them for subsequent attempts, making the training updates more concentrated on trajectories close to success, reducing exploration in the ineffective strategy space, and thus resulting in faster overall convergence and a more stable curve.
Figure 5: Comparison of training efficiency between ERL and RLVR
Ablation Experiments: The Impact of Memory and Reflection Steps on Training Performance
To better understand which mechanisms contribute to the performance improvement, the paper conducts an ablation analysis on the key components of ERL.
The study constructs two variants respectively: one is to remove the structured reflection step, so the model no longer generates reflections based on the first attempt but only makes another attempt based on the existing context; the other is to remove the cross-round memory mechanism. Although reflections are still generated and used for improvement in the current round, these reflections will not be saved for subsequent tasks.
The results show that when the reflection mechanism is removed, the model's performance drops most significantly (Table 1). Due to the lack of a structured summary of the reasons for failure, the second attempt is more like a simple "try again", making it difficult to form effective error correction. Therefore, the overall reward is significantly reduced, indicating that reflection is the core source of the immediate improvement effect of ERL. It provides the model with actionable behavior correction clues, making the attempts within the same round more targeted.
In contrast, removing the memory mechanism mainly affects the convergence speed. Although the model can still achieve improvement through reflection in a single round, due to the inability to accumulate effective error correction experiences across tasks, each interaction is more like inferring from scratch, resulting in a slower overall learning process. This shows that the role of the memory mechanism is to continuously retain effective strategies, enabling improvements to gradually accumulate during the training process, thus forming a more stable strategy improvement.
Table 1: Final performance of the ablation experiments
Paradigm Evolution from Imitation Learning to Experiential Learning
The author summarizes the current training methods of large models as a gradually evolving path: from supervised fine-tuning relying on example imitation, to reinforcement learning relying on reward signals for optimization, and then to experiential reinforcement learning (ERL) emphasizing learning from experience.
Compared with the former two, ERL provides an explicit path to transform failure into a usable learning signal by introducing reflection and internalization mechanisms, enabling the model to continuously accumulate behavior correction experiences during the interaction process. This perspective also echoes the view in recent years that "experiential data will become the main training source for the next generation of AI".
The paper points out that ERL demonstrates a possible path to build an experience-driven AI system. Through reflection and experience internalization, the model can continuously precipitate error correction strategies during the training process and directly apply these experiences during deployment without additional inference costs.
If this direction is further verified, it may become an important foundation for building long-term autonomous agents, gradually evolving reinforcement learning from a simple optimization method into a training paradigm closer to the human learning process.
Conclusion
The development of reinforcement learning is undergoing a subtle but important transformation. From trial-and-error optimization relying on reward signals to emphasizing experience transformation and behavior correction, researchers are trying to enable models to have learning abilities closer to humans.
The experience - reflection - internalization framework proposed by experiential reinforcement learning (ERL) demonstrates a possible path to transform the interaction process into continuous learning ability and provides new ideas for building agents capable of long-term self-improvement.
Reference: https://arxiv.org/abs/2602.13949
This article is from the WeChat official account "New Intelligence Yuan". Editor: LRST. Republished by 36Kr with authorization.