Google has just disrupted the concept of model memory, and NVIDIA is revolutionizing the attention mechanism.
Recently, Google's Nested Learning has triggered a memory earthquake in the model community.
Many people have re - realized that large models don't have to be read - only weights that are "sealed after training". They can also continue to change during the inference process. In Nested Learning, when the model reads new context, it doesn't just stuff the text into the attention cache for temporary retrieval. Instead, it allows itself to change parameters during the inference process, making new information a part of its internal memory.
But just when people were still digesting this idea, NVIDIA presented a more radical answer on December 28, 2025, in a paper titled "End - to - End Test - Time Training for Long Context". Google's memory enhancement approach is still struggling to solve memory problems and preserve important past information more completely. However, NVIDIA's researchers believe that memory is actually learning, and "remembering" means "continuing to train".
Just as people don't remember the exact words of primary school textbooks, but the feelings we got from articles like "The Monument" at that time deeply shape our subsequent values.
Researchers from NVIDIA and Stanford believe that AI should also work in this way.
01 Replace Attention - Based Memory with Learning
If you look back along the timeline, you'll find that TTT (test - time training) isn't an invention that came out of nowhere.
As early as 2013, Mikolov et al. tried dynamic evaluation in language models. At that time, they unfroze the model and continued to perform small - step gradient updates on the test text using the cross - entropy loss CE of next - word prediction (which is the parameter learning loss target we most commonly understand for large language models), allowing the parameters to adapt to the current style, theme, and local statistical rules. Krause et al. improved it in a more systematic and feasible way in 2018.
That is to say, in the early days of large language models, people had already discovered that allowing the model to change parameters during inference not only doesn't violate the basic logic of language modeling but can even bring benefits.
When analyzing Nested Learning, people mainly discuss the innovation of memory. But few people notice its substitution of the attention layer in the context of long - term context. However, the emergence of TTT - E2E more explicitly presents this possibility.
In the past decade, the success of Transformer has largely been built on the attention mechanism. It indexes every sentence it reads (KV Cache) and precisely flips through the old "books" every time it answers a question. This mechanism is precise but very memory - consuming. Therefore, there have been various improvement strategies such as grouped attention and linear attention, aiming to compress its memory usage and increase the model's context length.
The TTT solution directly abandons the approach of "internalizing" (updating weights) knowledge to solve the context processing problem. Regardless of how long the context is, the size of its inference state and the amount of computation remain constant.
Therefore, in the TTT family, no matter how the context grows, its Latency (generation delay) doesn't change at all.
This is the core ability of TTT that can replace attention during the inference stage: remembering an almost infinite context without delay.
However, the dynamic evaluation approach has never really become the mainstream deployment paradigm. This is because it was still very immature in engineering at that time and difficult to use effectively. The main gap lies in the misalignment between the training stage and the inference stage.
The training stage optimizes the "out - of - the - box performance with frozen parameters" without considering "performing several steps of updates during inference" as part of the model's behavior in the objective function. This leads to a lot of instability in engineering practice. Without constraints, the model continuously updates, and risks such as catastrophic forgetting (forgetting old knowledge while learning new knowledge), parameter drift (the model's parameter distribution becomes strange), and overfitting to abnormal segments (repeating strange words) become the norm.
The main means to alleviate these issues in the early methods were "small learning rate, few steps, and frequent resets". These could barely make the system usable but almost locked TTT at the scale of "short - term adaptation", making it difficult to develop into real long - term memory.
What Nested Learning / Titans do is to make this logic feasible at the architectural level. By separating layers with different update frequencies and allowing each layer to update independently, parameter updates are stabilized. This also transforms TTT from short - term fine - tuning to a way of long - term internal memory. Therefore, we can say that it brings a stable way of long - range memory update.
However, there is a cost. NVIDIA classifies Nested Learning and Titans into TTT - KVB in their paper. This is because their update goals are actually somewhat different from traditional TTT. They are more like teaching the model "how to store" rather than directly teaching it "how to predict".
As we all know, the ultimate goal of large language models is to "predict the next token", which is the original learning purpose. The update goal of Nested Learning is usually to make the model reconstruct the corresponding value from a certain compressed representation (such as key) or make the hidden state evolve self - consistently within the layer. These are all to build an internal memory structure that can be quickly indexed. This can indeed indirectly help the language model complete tasks, as better internal associative memory may lead to better predictions. However, there is always a gap between this and the ultimate goal.
The TTT - E2E proposed by NVIDIA is more like the original dynamic evaluation. Its update goal during testing is the cross - entropy CE of next - word prediction at the end of the entire network. To have only one goal, this method is end - to - end, without layer division, and only updates this one CE from start to finish. When the loss function is the final task itself, anything the model learns from the context directly optimizes subsequent predictions, completely aligning with the model's ultimate goal.
To clarify this difference, they designed a "toy model" in the paper. They removed all self - attention layers in the Transformer, leaving only the multi - layer perceptron (MLP). This basically downgraded the model to a "bigram model" that can only remember the previous word. In this setting, any long - range memory ability can't come from attention or caching but only from "updating the weights during testing and compressing the context into the parameters" itself.
Then during testing, they made the model practice continuously when reading \(x_1,x_2,x_3,\cdots\): predict \(x_t\) using \(x_{t - 1}\), calculate the CE, and perform a small - step gradient descent on this loss.
This is like an explorer who can only see one meter in front of him, guessing the next step based on the one he just took. And you need to cross a 10 - kilometer cave (go through all the context and changes).
Every step, you first predict "According to my sense of direction, should I see a rock or a puddle next?"
Then take a step and see if the prediction is correct.
If it's wrong, you adjust your body posture and steps (gradient update).
In the cycle of "prediction - correction - adjustment", your "muscle memory" (weights) changes.
When you reach the 1000th step, although you can't see the boulder at the first step, the information of that boulder has been encoded in your current gait, center of gravity, and sense of direction. It has been passed down through 999 times of "prediction - correction - adjustment" and integrated into your body.
As a result, this model without any attention cache has its target Loss curve (blue) of "training for next - word prediction" drop rapidly as the reading length increases. It almost hugs the curve of the full - attention Transformer (orange line).
This means that by simply modifying its neural network parameters (MLP weights), it perfectly encodes context information, achieving almost the same effect as storing all the words (Full Attention).
In contrast, the original design of TTT - KVB is to be a direct substitute for the self - attention layer. Its core idea is still "Key - Value Binding". That is, although it doesn't use the traditional attention mechanism to store KV Cache, it tries to use the neural network to learn the mapping relationship between Key and Value.
This is like hoping to draw every stone in the cave on a map for immediate access. Even information like the texture of the boulders, which is irrelevant to getting out of the cave, will be drawn. Its training efficiency is relatively slow.
The paper proves this in the transition experiment results. After the researchers replaced the in - layer key - value binding goal of TTT - KVB with the end - to - end next - token prediction goal, the evaluation loss of language modeling significantly decreased.
From the experimental data, this change indeed brings substantial improvement. On a model with 760M parameters, the loss of TTT - KVB in an 8K context is 2.818, while after using the next - token prediction loss in its simplified version (TTT - E2E all layers MH), the loss drops to 2.806.
This improvement of 0.012 is actually a significant gap in language model evaluation. This shows that after the end - to - end transformation, the model is indeed more confident and proficient in predicting the next token. Moreover, long - context ability can really be obtained purely through learning during testing without relying on attention caching.
Under this logic, memory is no longer designed as a storage structure but is redefined as a continuous learning process. The value of memory doesn't lie in how completely it preserves the past but in whether it can change your next judgment.
However, the problem with the past dynamic evaluation is the lack of a stable engineering model. Since TTT - E2E uses the same idea, how does it overcome these problems?
This is exactly the second thing NVIDIA is going to do next: use meta - learning and a set of engineering safeguards to turn this end - to - end test - time learning into a stable and scalable context memory system.
02 The Echo of Meta - Learning and Engineering Stability
The concept and practice of meta - learning actually emerged very early. One branch of explicit meta - learning ideas has been inherited until the Deepmind DiscoRL released last year.
This is the MAML system proposed by Finn in 2017. It consists of two nested loops, an inner loop responsible for adaptive learning (gradient descent) and an outer loop responsible for making the adaptive learning more effective (learning the gradient of the gradient). In this way, the outer loop is more like a reflection on the steps of the inner loop, through which one can learn how to learn efficiently.
What TTT - E2E does is to use this set of meta - learning systems to help stabilize end - to - end data.
NVIDIA's researchers believe that the main problem with the past dynamic evaluation lies in the "mismatch between training and testing". If you only train a frozen language model in the traditional way and then suddenly require it to update parameters while reading during testing, the whole system will definitely be unstable, and catastrophic drift and forgetting are common. Therefore, the training stage should include the learning process of the testing stage, allowing the model to get used to continuing to learn during inference when it leaves the factory.
This is when meta - learning comes in. It helps the model learn how to update itself during training so that it can answer subsequent questions better. The specific operation is to use meta - learning to let the model find the most suitable initial parameters \(W_0\) for updating during inference.
Written as a more intuitive process, it is two nested loops.
Inner loop: When the model reads a context, it makes a guess about the next word. Then it immediately compares it with the actually appearing next word and updates its own parameters. This is consistent with the training of the traditional next - token prediction model.
Outer loop: During the training stage, it repeatedly simulates the "on - the - job state" for the inner loop. It provides the inner - loop model with many text segments, allows it to make several small corrections in the same review way, and then checks whether the subsequent predictions of the inner loop are indeed more accurate and stable after the corrections. Only when the parameter update of the inner loop really brings benefits does the outer loop reward it. If this update method causes drift or forgetting, the outer loop punishes it. Over time, the model learns a more suitable out - of - the - box state. With these initial parameters on the job, the small corrections (gradient updates) of the inner loop are less likely to damage the model.
The "teacher" in the outer loop learns which directions of gradient updates are stable during testing (to prevent gradient explosion), which updates can quickly absorb context rules without destroying general abilities (to prevent catastrophic forgetting), and which initializations can produce more reliable benefits with the same learning rate and number of steps (to improve training efficiency). Then it integrates all these into the model's initial parameters.
Meta - learning directly enables the model to solve the core engineering dilemma, making the end - to - end model possible.
However, this is only a possibility, not yet stable. To further ensure the engineering feasibility, TTT - E2E still sets up multiple compromise safety valves in engineering.
The first safety valve is mini - batching and sliding window attention. Theoretically, updating parameters every time a token is read during testing is the most fine - grained and perfect online learning, but it is too costly. However, if the token batch given each time is too large, the model has no short - term memory at all. Then it won't