"Lifelong Self-Learning" AI Arrives: MIT's Self-Distillation Fine-Tuning (SDFT) Ends Catastrophic Forgetting Forever

You can learn new skills without forgetting old knowledge.

Is it possible for an artificial intelligence (AI) model to learn new skills without forgetting old knowledge?

Recently, a team from the Massachusetts Institute of Technology (MIT) proposed an innovative method called Self-Distilled Fine-Tuning (SDFT). This method enables the model to continuously learn multiple new skills. Not only does it outperform traditional supervised fine-tuning in terms of accuracy, but it also achieves almost “zero forgetting” in ability accumulation.

For a long time, although AI systems have shown strong performance, they often become “static” after deployment and have difficulty evolving continuously through parameter updates. The core challenge lies in how to absorb new knowledge without losing existing capabilities.

Experiments show that SDFT can help a single model gradually master multiple skills without performance regression during the continuous learning process, providing a new path for realizing a truly “lifelong learning” AI system.

How does SDFT solve the problem of continuous learning?

To enable AI to learn continuously like humans, the current mainstream approaches face two major obstacles.

On the one hand, although policy-based reinforcement learning can effectively reduce forgetting, it relies on explicit reward functions that are extremely difficult to design in reality. On the other hand, supervised fine-tuning (SFT), which learns directly from expert demonstrations, is simple and easy to implement. However, it is essentially “off-policy” learning. The model passively imitates a fixed, past expert data distribution. Once it starts learning a new task, it is prone to deviate from its original state, leading to “catastrophic forgetting” – learning new things while forgetting old ones.

Figure | SFT is usually used to learn from expert demonstration datasets, but its off-policy nature can lead to catastrophic forgetting of general abilities. The research team proposed SDFT, which converts expert demonstrations into on-policy learning signals by using a demonstration-conditioned version of the model as its own teacher. SDFT achieves true continuous learning in this way, allowing the model to continuously improve when new tasks appear without degrading existing capabilities.

The core of SDFT is to cleverly utilize the powerful in-context learning ability of large models to convert static demonstrations into dynamic on-policy training signals. During training, the model plays two roles simultaneously. As a “teacher,” it generates a better and more intention-aligned output distribution based on task inputs and expert demonstrations. As a “student,” it only responds based on task inputs. During the training process, the model continuously narrows the gap between the student's output and the teacher's distribution through self-distillation, and the learning is completely based on the trajectories generated by the student itself.

Figure | SDFT uses the in-context learning ability of the model to generate on-policy training signals. For each query x, the model plays a dual role: one is the “student” based only on the query P = π(·|x), and the other is the “teacher” based on the expert demonstration c, which generates a demonstration-aware distribution Q = π(·|x, c). The training process obtains on-policy updates by minimizing the reverse KL divergence between the “student” and the “teacher.”

This design enables the model to achieve on-policy learning without relying on external rewards, thereby retaining existing capabilities while absorbing new knowledge.

Is SDFT really effective?

To verify the actual effectiveness of SDFT, the research team designed two types of experimental scenarios covering skill learning and knowledge acquisition and systematically compared them with baseline methods such as SFT.

In terms of skill learning, the research selected three tasks: scientific question answering, tool use, and medical reasoning. Experiments show that SDFT achieved higher accuracy than SFT on these new tasks, which reflects better in-distribution generalization ability.

What is more noteworthy is the multi-task continuous learning experiment: when the same model learns three different skills sequentially, SDFT can gradually accumulate abilities without regression, while SFT shows serious interference – once the training shifts to a new task, the performance of early skills declines rapidly.

These results prove that SDFT supports true continuous learning, enabling a single model to gradually master multiple skills without catastrophic forgetting.

Figure | In a challenging continuous learning experiment, when a model is trained on three different tasks sequentially, SDFT can learn each task while maintaining performance on other tasks. In contrast, SFT's performance on each task declines when it starts learning the next task.

In the knowledge acquisition task, the research team injected new facts (such as natural disasters that occurred in 2025) that were not covered in the model's training. The results show that SDFT achieved a strict in-distribution accuracy of 89%, better than SFT's 80%, and close to the performance of the RAG system using ideal retrieval.

More importantly, on out-of-distribution questions that require reasoning by combining new knowledge, SDFT performed almost perfectly, while SFT lagged significantly. This indicates that SDFT can help the model truly integrate new knowledge into its internal representation rather than mechanically memorize it.

In addition, the experiments also revealed two key findings:

First, the larger the model size, the more obvious the advantages of SDFT. Because the core of this method relies on the in-context learning ability of the model, and large-scale models have stronger abilities in this regard, which can provide higher-quality guidance signals for self-distilled fine-tuning.

Figure | SDFT benefits from model size. In the scientific question answering task, the performance gap between SDFT and SFT widens as the model size increases because larger models have stronger in-context learning abilities.

Second, SDFT can effectively train reasoning models without explicit reasoning process data. When only the final answers are provided for fine-tuning, traditional SFT can cause the model's reasoning behavior to “collapse,” resulting in significantly shorter generated content and a decline in accuracy. In contrast, SDFT can maintain the model's original complex reasoning pattern while improving task accuracy through its unique self-distillation mechanism.

Table | Training a reasoning model using supervised learning with only answers. SFT reduces task performance and overall reasoning ability (manifested as shorter answer times). SDFT avoids this performance decline by learning from a demonstration-conditioned teacher rather than directly from demonstrations.

The significance and limitations of SDFT

SDFT provides a clear and effective path for achieving continuous learning from demonstrations. However, its significance and value need to be examined from a broader perspective, and its current limitations also need to be faced.

SDFT is not intended to replace reward-based reinforcement learning but to complement it. In scenarios where there is a lack of clear reward signals, SDFT can directly use demonstrations for high-quality initialization. The high-quality and diverse results output by SDFT can serve as a high-quality starting point for subsequent reinforcement learning fine-tuning, thereby improving overall training efficiency.

In terms of computational cost, the computational overhead of a single SDFT training is approximately 2.5 times that of traditional supervised fine-tuning because it requires real-time generation and learning. However, compared with multi-stage continuous learning methods that require “fine-tuning first and then patching,” SDFT's single-stage integrated training process can often achieve better comprehensive performance in a shorter total time.

Figure | SDFT improves the pass@k metric at different k values, indicating that this is a real skill improvement rather than a decrease in entropy.

Despite its broad prospects, SDFT still faces some challenges:

1. Ability dependence: Its effectiveness highly depends on the in-context learning ability of the base model itself. For models with a small size or weak in-context learning ability, the quality of the teacher signal is limited, and the advantages of the method are not obvious.

2. Language artifacts: The student model may occasionally imitate specific language patterns of the teacher caused by seeing demonstrations (for example, adding “According to the above example...” before answering). Although loss masking of tokens in the early stage of training can effectively suppress these artifacts, this is still a phenomenon that needs attention.

3. Scope of application: SDFT is good at “enhancing” and “adjusting” the model's original behavior patterns. However, it is more difficult for tasks that require a complete change in the generation pattern, such as transforming a model that is not used to outputting thinking chains into a complex reasoning model.

These challenges also point out the direction for future exploration: to more deeply integrate SDFT with reinforcement learning; to develop auxiliary technologies to further reduce forgetting; and to expand it to complex but real continuous learning scenarios such as non-expert demonstrations, noisy data, and even more open user interactions, so as to make the continuous learning ability of AI more robust and practical.

This article is from the WeChat official account “Academic Headlines” (ID: SciTouTiao). The author is Academic Headlines. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The "lifelong self-learning" AI is here. MIT has proposed Self-Distillation Fine-Tuning (SDFT), bidding farewell to catastrophic forgetting once and for all.

How does SDFT solve the problem of continuous learning?

Is SDFT really effective?

The significance and limitations of SDFT