HomeArticle

Meta Removes the Biggest Hurdle on the Road to Continuous AI Learning, Giving "Fine-Tuning" a Fighting Chance

36氪的朋友们2025-10-27 13:12
The "Achilles' heel" of SFT has been basically cured.

Since Richard Sutton criticized large language models (LLMs) for lacking true continuous learning and meta - learning capabilities in his article "The Age of Experience", the entire LLM community has recently been actively seeking ways to break through this ceiling.

In the past, there have been numerous attempts in the industry to create "self - evolving models". These attempts are logically consistent with the goal of continuous learning, which means that the model should be able to gradually evolve and become stronger. However, it is only recently that several paths to achieving this goal have started to become clearer.

01

Three Paths to Continuous Learning

The reason for this clarity is that the continuous learning ability of mainstream models is essentially related to the "memory" depth and plasticity of the model. Only when the model can stably update or add new memories can it continuously learn new things.

Therefore, the number of ways to change or add memories determines the number of major paths for continuous learning.

In the current field of large language models, the methods that can change the model's memory can be roughly summarized into three paths.

The first path is to change the "context", that is, to modify the model's "working memory".

The corresponding learning method is called "In - Context Learning (ICL)". By providing the model with new information, examples, or instructions in the prompt, the model can "learn" to solve specific problems in the current conversation.

The latest development in this path is the concept of "System Prompt Learning", which has been strongly promoted by Andrej Karpathy recently.

The core idea is that the model summarizes and generalizes its problem - solving behaviors at the language level, reflects on successes and failures, and then continuously updates its system prompts to improve its ability to solve similar problems in the future.

This method addresses the problem that ICL is often criticized for being superficial by influencing the model's most fundamental behavioral instructions, allowing the learning results to be more deeply ingrained.

The second path is to introduce an "external memory bank", that is, RAG.

This involves equipping the model with an external database for comparison and retrieval when needed. Continuous learning is manifested in the model's ability to modify, accumulate, and maintain this external memory bank.

The latest exploration in this area is Google DeepMind's "Reasoningbank" research. Instead of providing the AI agent with a fragmented "fact memory bank", it creates an "advanced brain memory bank". What is stored in this bank is not simple facts like "10 + 10 = 20", but "methodologies" and "pitfall - avoidance guides" summarized by the model from successful and failed experiences.

Both of these paths, whether it is reflecting on one's own prompts or maintaining external methodologies, represent a "meta - learning" shift from the traditional continuous learning model.

In recent product - oriented explorations, Anthropic's Claude Skill function is an attempt to combine these two levels (especially the first level) of methods, enabling the agent to "learn" new skills by summarizing experiences.

However, the third path, that is, continuous learning at the "parameter level", which is the most fundamental and core method, has not made much progress.

Currently, the methods that can directly change the model's parameters are either like Reinforcement Learning (RL), which has high training costs and complex processes and cannot be frequently used to learn new knowledge after the model is deployed, or like LoRA, a lightweight Supervised Fine - Tuning (SFT) method, which is extremely unstable.

This has left the most fundamental path of parameter update in an awkward situation of stagnation for a long time.

However, a recent paper from Meta AI, "Continual Learning via Sparse Memory Finetuning", may bring fundamental changes to this long - dormant third path.

02

Curing the "Achilles' Heel" of Supervised Fine - Tuning (SFT)

SFT (Supervised Fine - Tuning) has always faced a fundamental contradiction: it is the most direct means to endow the model with specialized capabilities, but due to the difficult - to - overcome problems of "catastrophic forgetting" and instability, it has become a bottleneck for the model's ability improvement.

Catastrophic forgetting specifically refers to the phenomenon that when updating parameters, the model forgets its original knowledge in the process.

Take LoRA (Low - Rank Adaptation, a method for efficient fine - tuning of large pre - trained models) as an example. It is regarded as the most promising continuous learning method because of its low cost and the need to adjust only a small number of parameters. However, in practical applications, just a few thousand steps of fine - tuning may enable the model to learn new skills but seriously damage its original general capabilities.

The root cause of this phenomenon is that the model's parameters are shared by all tasks. When you adjust a set of parameters to learn new knowledge, you may easily destroy the same set of parameters that store old knowledge, resulting in forgetting and a decline in capabilities.

Meta's new paper aims to solve this persistent problem.

They proposed a method called Sparse Memory Finetuning. The core idea is: If we can precisely update only those parameters that are related to "new knowledge" and have nothing to do with "old knowledge", can we avoid interference?

To achieve this goal, Meta AI has developed a complete technical closed - loop.

Step 1: Modify the architecture by adding a memory layer that is easy to modify.

Meta uses a relatively new and special model architecture here, the Memory Layer Models. Researchers replaced some Feed - Forward Network (FFN) layers in the standard Transformer model with a memory layer.

The difference between the memory layer and the standard FFN layer is similar to the difference between MOE and dense models. When a problem comes in, all parameters in the standard FFN layer need to be mobilized for calculation.

The memory layer is completely different. It has one million "micro - experts" (i.e., memory slots). Each expert stores knowledge at a very fine - grained level. When a problem comes in, the model generates a query to find the top - k most relevant experts (for example, k = 32) among all parameters. Only these 32 experts are activated and provide information. This design is very similar to an extreme version of MoE, but it has millions of micro - experts instead of the 8 or 16 large experts in traditional MoE.

Researchers removed the FFN layer in the 12th layer of a 22 - layer standard Transformer model and replaced it with a memory layer.

At this time, the data flow in the entire model is still linear. Tokens will pass through each layer of the model in turn, but when it reaches this memory layer, the calculation method changes from "dense" to "sparse".

The reason for this modification is that the memory layer has so many memory slots, which can provide a much more precise "control ability" than LoRA. These one million "knowledge drawers" (memory slots) that can be independently addressed and modified can be directly used to modify and add new knowledge. Neither the previous dense models nor MOE can achieve this.

Step 2: Accurately locate the "drawers to be updated" using TF - IDF.

With an architecture that can be precisely controlled, the next question is: when new knowledge comes in, which "drawers" should we update?

Researchers found that even when using the memory layer method that only needs to update a small number of parameters, simply updating all relevant parameters will still cause catastrophic forgetting.

So the key question is: how to accurately screen out those "important and safe" parameters? Meta introduced the classic algorithm in the field of information retrieval: TF - IDF to solve this problem.

The two values in this algorithm are:

TF (Term Frequency): It counts which of the one million experts (memory slots) are accessed most frequently in the training batch of the current "new knowledge". The more times an expert is accessed, the stronger its correlation with this new knowledge.

IDF (Inverse Document Frequency): It counts which experts are least frequently used in a fixed "background corpus" (such as pre - training data) representing general knowledge.

Therefore, a high TF - IDF score for a memory slot means it is crucial for this new knowledge (high TF) and is hardly responsible for general daily knowledge (high IDF).

Through this algorithm, researchers can find the most suitable parameters to be updated in the memory layer.

Step 3: Sparse update, only modify the "Top - t" parameters.

With precisely controllable parameters and a method to find the most suitable parameters, the update process can be as precise and restrained as a targeted drug. During the back - propagation process of parameter update, the model freezes almost all parameters and only allows the gradient to flow to the Top - t memory slots with the highest TF - IDF scores to change their data.

In this way, the model can "write" new knowledge using only 500 out of one million slots. Compared with the total capacity of one million in the memory layer and the tens of millions of updates in traditional SFT, this number is negligible.

The result is that the "Achilles' heel" of SFT is basically cured.

This three - step method of "architecture modification + accurate location + sparse update" has an immediate effect. In the most critical comparative experiment in the paper, researchers let the model learn a new set of facts (TriviaQA) and then tested its performance on the original task (Natural Questions) to see how much it "forgot".

The result shows that with this new method of sparse memory fine - tuning, the score of the original task only decreased by 11%, while LoRA caused a 71% drop, and full - scale fine - tuning led to an 89% drop.

This new method is comparable to or even better than LoRA and full - scale fine - tuning in terms of learning ability, but it shows an overwhelming stability advantage in the core pain point of forgetting. It almost perfectly cures the "Achilles' heel of SFT.

In addition, this method also shows great learning potential. According to the qualitative analysis in the paper, storing 1000 new facts only requires about 500 memory slots. This means that a memory layer with one million memory slots generally has enough space to continuously learn a large amount of new knowledge.

In terms of training cost, the number of parameters that need to be updated in each step is also much less than that of LoRA, reducing the memory overhead of the optimizer.

This data strongly proves that this new method shows very little forgetting while learning new knowledge. It almost perfectly solves the core problem of SFT's instability and easy deterioration, turning the high - risk wish of "safely updating model parameters" into a stable and feasible engineering reality.

03

The Embarrassment of Meta - Learning and the Advantages of SFT

In the first part, we discussed the meta - learning shift of current in - context learning and RAG: they are all aimed at enabling the model to learn "how to learn" or "summarize methodologies".

This is because the real realization of continuous learning requires the model to be able to spontaneously learn from observations rather than simply accepting manual input. Only in this way can it know when and what to learn.

However, both of these methods have a fundamental embarrassment. The model is like a student who has to refer to an external textbook (RAG) or review their own notes (System Prompt) every time they take an exam. It is hard to believe that this student has truly internalized the knowledge.

Many related studies are also skeptical about this. A highly - regarded paper published in September 2025, "IS IN - CONTEXT LEARNING LEARNING?", found through probe experiments that in - context learning is indeed a learning mechanism, but it tends to over - focus on the statistical features in the observed example distribution rather than learning the fundamental laws of the task. This results in very limited generalization ability of in - context learning.

As for RAG, it is essentially a form of in - context learning. The only difference is that this context is stored externally and needs to be searched for, found, and converted into internal context when in use, just like memorizing from a book.

This superficial and rote - learning characteristic means that non - parametric learning paths (in - context learning and RAG) can only be a stopgap measure.

Fundamentally, we still hope that those "methodologies" and "new knowledge" can truly affect the model's parameters, enabling it to find patterns internally and make them part of its instinct.

Therefore, the third path (parameter update) may be a more fundamental solution.

In the past, this path was blocked not because we didn't want to take it, but because S