HomeArticle

Is the Transformer dead? DeepMind is betting on another path to AGI.

新智元2026-01-09 10:36
Nested learning breaks through the bottleneck of continuous AI learning and may usher in a new era of AGI.

Drawing on human associative memory, nested learning enables AI to build abstract structures during operation, surpassing the limitations of Transformer. The Google team emphasizes that the optimizer and the architecture are contextually related to each other, and only through co - evolution can true continual learning be achieved. This paper may become a classic, opening the door for AI to evolve from passive training to active evolution.

"Catastrophic forgetting," a specter that has haunted the AI community for decades, may be completely resolved this time.

In the past year, the rapid progress of AI is no exaggeration. Just the achievements of Google DeepMind in one year are dazzling:

But if DeepMind were to select the most important research or product in 2025, the recently popular "Nested Learning" would surely be on the list.

After reading the paper, some netizens posted that this paper is the "sequel" of "Attention is All you Need".

If Transformer opened the Scaling era, then nested learning may be opening the era of true AGI.

Shane Legg, the founder of DeepMind, is even more straightforward. The path to AGI is smooth, and the latest progress is nested learning.

Some netizens even said that if we were to leave a paper for future aliens, it would definitely be this "Nested Learning".

If 2 - 3 breakthroughs are needed to achieve AGI, continual learning may be one of them, and Google has published several relevant papers.

However, these papers have a common author ──

Ali Behrouz, a second - year doctoral student in the Department of Computer Science at Cornell University and a research intern at Google Research (New York).

The Memory Woes of Transformer

In many aspects, Transformer performs excellently. It can scale, drive AI forward, and achieve generalization ability across tasks and domains.

But Google realized early on that: Transformer is not perfect.

1. Low efficiency in long - context processing

2. Limited levels of abstract knowledge

3. Weak adaptability

4. Lack of continual learning ability

Especially the fourth point, which Ali believes is the most critical issue.

When we mention "Continual Learning", we mean:

There is no training period and no testing period;

The model continuously shapes new memories and abstract structures during use.

Humans are born with this ability.

But for today's large language models, there is almost no "continual learning".

To illustrate how fundamental the problem is, Ali used a medical analogy: Anterograde Amnesia.

Patients with this disease have a very strange characteristic:

  • Their short - term memory is normal
  • Their long - term memory is also intact

But the problem is: 👉 Short - term memory cannot be transferred to long - term memory.

So, they always live in the "present".

New experiences come in and disappear after a while; the world is changing, but their brains no longer update.

Now, apply this disease to LLM.

You will find that the large model is exactly the same as human patients.

Today's large language models mainly obtain knowledge from two sources:

Long - term knowledge learned during the pre - training phase;

Short - term information in the current context.

But there is almost no channel between the two.

The AI model cannot naturally precipitate what it "just learned" into reusable knowledge for the future.

Want it to really learn?

You can only: spend more money, train again, and fine - tune again.

This is essentially no different from the state of patients with anterograde amnesia.

The real problem is not that there are not enough parameters, not that the data is not large enough, and not just that the computing power is insufficient.

The essence of the problem is that there is no natural knowledge transfer channel between "short - term memory" and "long - term memory".

If this channel does not exist, the so - called "continual learning" will always be just a slogan.

This leads to a core question: How can we build a mechanism that allows AI models to precipitate "present" experiences into "future" knowledge like humans?

All AI is "Associative Memory"

If you want AI to truly have the ability of continual learning, you can't avoid a most fundamental question:

How does the model "remember things"?

Ali's answer is not Transformer, not the number of parameters, but a more primitive and fundamental concept: Associative Memory.

The so - called "associative memory" is the cornerstone of the human learning mechanism.

Its essence is to associate different events or information through experience.

For example, when you see a face, you immediately think of a name; when you smell a certain smell, it evokes a memory.

This is not logical reasoning, but the establishment of associations.

Technically, associative memory is a key - value pair mapping:

  • Key: Clue
  • Value: Associated content

But the key is that the mapping relationship of associative memory is not pre - written, but "learned".

From a certain perspective, the attention mechanism is essentially an associative memory system: it learns how to extract keys from the current context and map them to the most appropriate values to generate outputs.

What will happen if we not only optimize this mapping itself, but also let the system meta - learn the initial state of this mapping process?

Based on the understanding of associative memory, they proposed a general framework called MIRAS for systematically designing memory modules in AI models.

The core idea of this framework is:

Almost all attention mechanisms, local memory structures, and even the optimizer itself can actually be regarded as special cases of associative memory.

To design a "learnable, nested memory system", we need to make four major design decisions regarding the memory structure in the model:

Memory Architecture

Attentional Bias/Objective

Retention Gate

Learning Rule

This framework can be used to uniformly explain many existing attention mechanisms and optimizers.

In simple terms: MIRAS enables us to model, combine, and optimize "memory" as a learning process, rather than just a static module.

Furthermore, the optimizer can also be uniformly regarded as an associative process of "mapping the current gradient to historical information", and it can be re - modeled and generalized.

The optimizer is a "memory module" and a key component for the model to understand its learning history and make better decisions.

The optimization process and the learning algorithm/architecture are essentially the same concept, but they have different contexts (i.e., gradients and data) at different levels of the system.

In addition, they are two interconnected components, where the learning algorithm/architecture generates context (i.e., gradients) for the optimizer. This supports the idea of designing a dedicated optimizer for a specific architecture.

Thus, Google's team explored the knowledge transfer methods between different levels and proposed nested learning.

Nested Learning Cures LLM Amnesia

Based on the view of NL, the Google team uses backpropagation and gradient descent to train deep neural networks, which is essentially a compression and optimization problem. Its goal is to train an associative memory to map the input of each layer to the corresponding local error in its prediction.

Correspondingly, they believe that pre - training is a form of in - context learning, where the context is all pre - training data, and each network layer compresses this context into its parameters. Other popular gradient - based optimizers are essentially also associative memories aiming to compress gradients into their parameters.

The optimizer and the architecture are not independent of each other. They should be contextually related and co - evolve.

To formally define the Nested Learning framework, two core concepts need to be understood: Update Frequency and Nested System.

I. Update Frequency

First, set a time unit. Usually, one update of the slowest - updating module in the system is taken as the basic time unit.

Then, for other components in the system, we count how many times they are updated within one time unit, and we get their "update frequency".

For example:

Assume that for a Recurrent Neural Network (RNN) with a context length of (L), when processing a text of length (L), the state of the RNN will be updated (L) times;

And if the outer layer is a module that only updates at each document level (such as a pre - trained model), then its update frequency is 1.

Therefore, we can say that: The RNN is a "fast module", and the pre - trained model is a "slow module".

Through this combination of different frequencies, we can build a system that can learn on different time scales.

II. Nested System

Next, we define what a "nested system" is.

It consists of multiple independent sub - modules, and each module has:

Its own parameters (Parameter);

Its own context (Context);

Its own optimization objective (Objective);

Its own gradient flow (Gradient Flow).

These modules form a nested relationship based on the speed of their "update frequency".

If we regard each sub - module as an associative memory system, then the entire model can be regarded as a Nested Associative Memory System.

Furthermore, each such associative system can itself be composed of smaller optimization sub - processes, forming a recursive nesting.

<