A masterpiece from Berkeley stabs OpenAI in the back: Continuous learning is the real deal.
AI engineer Dan McAteer boldly predicts that continual learning will explode in 2026!
Through a hierarchical mechanism of rapid adaptation via memory/context and slow weight adjustment, the model retains plasticity and avoids catastrophic forgetting. This breakthrough is a thousand times greater than the inference revolution.
This courage comes from recent AI experiments conducted by institutions such as Berkeley.
They had the same large language model learn three tasks consecutively:
First, learn the fact-checking task HoVer that requires multi-hop retrieval; then learn code reasoning CodeIO; finally, learn physics problems Physics.
They switched tasks after 200 training steps for each task, simulating the learning scenario of "constantly changing tasks" in the real world.
When trained using the mainstream reinforcement learning (RL) paradigm, the model learned the first task, HoVer. But it got completely stuck on the second task, CodeIO, and couldn't learn anymore.
With their proposed new framework FST (Learning, Fast and Slow), the same model was able to learn all three tasks.
This is the first time that the ceiling of a certain direction that the AI industry has collectively bet on in the past two years has been revealed.
Title: Learning, Fast and Slow: Towards LLMs That Adapt Continually. Preprint: https://arxiv.org/abs/2605.12484. Project homepage: https://gepa-ai.github.io/gepa/blog/2026/05/11/learning-fast-and-slow/
If the path we've collectively bet on is turning the model into a "genius who can solve problems but can't learn new things," then what exactly are we betting on? Is it AI or just an increasingly sophisticated parrot?
"Reasoning" Has Become the Entire Narrative in the AI Circle
In the past two years, almost all top laboratories have been doing the same thing: making the model think deeper.
Products like OpenAI's o series, DeepSeek's R1, and Claude's thinking mode have different forms, but they all share a common core: Reasoning ability is the next challenge for AI.
How strong is this consensus?
It's so strong that if you go to front-line investors today and can't explain how you "do reasoning," you won't even make it past the first round.
It's so strong that we've forgotten to ask: What exactly is reasoning?
For example, a student can think extremely deeply about any college entrance exam question, with an impeccable reasoning chain and a flawless logical structure.
But there's a premise: since the day he graduated from junior high school, he hasn't learned any new knowledge. All his knowledge reserves have remained at the state of when he was 16.
Would you call his ability "intelligence"?
This analogy is not just a rhetorical device. It's the real situation of the current most advanced LLMs.
All the models you can use today, such as GPT - 5, Claude, and Gemini, are like geniuses who graduated yesterday, woke up today, and forgot everything at the start of each new conversation.
They can reason deeper and deeper on a single question, but as soon as the dialog box is closed, their memory is cleared, and they return to the "genius state" like when they were first deployed.
They are digital Sisyphuses climbing the rock of reasoning repeatedly - climbing higher and higher, but always starting from the foot of the mountain.
The question is, why haven't we noticed this?
After 30 Years of Failure in AI History, People Dare Not Expect Anymore
Why doesn't GPT learn anything from your conversations with it? Why does it completely forget what you taught it yesterday when you start a new conversation today?
This is a wall that no one has been able to break down for 30 years.
"Continual Learning" in the field of AI studies how to make models "review the old and learn the new, discard the old and absorb the new" like humans.
This problem has been studied since the 1990s and has repeatedly failed in the face of three old adversaries:
The first adversary is called "primacy bias," where early data will dominate the model's final strategy.
The first thing the model learns will stubbornly shape the way it learns everything else.
The second adversary is called "loss of plasticity," which means that the more tasks the model learns, the less plastic it becomes.
At a certain critical point, it can no longer learn any new things.
The third and most well - known adversary is called "catastrophic forgetting" - when you teach the model a new task, its old abilities "collapse" suddenly.
If you teach it to do math problems, it forgets how to write code. If you teach it to write code, it forgets how to have a conversation.
These three problems have existed since the era of small models.
In the era of large models, they haven't become smaller; they've just become less noticeable.
Because we've simply given up on making the model "continually learn." We only inject knowledge once during training and freeze the model after deployment.
All the LLMs we use today are essentially frozen geniuses.
They are smart but can't get any smarter. They are powerful but live in an eternal present.
This is why in the era of large models, continual learning has always been a topic that "sounds good but no one dares to touch."
Those who have tried have all been pushed back by this wall.
But recently, a group of researchers have pushed a crack in this wall - they didn't invent a new algorithm; they did something more fundamental: redistributed the work.
Make the Model Layered with Fast and Slow Learning, Just Like the Brain
This is a project that combines the engineering power of Databricks, the system school of Berkeley, and the classic ML school.
The authors are impressive and worth a look: Matei Zaharia (co - founder of Databricks, author of Apache Spark), Joseph Gonzalez (from Berkeley, one of the authors of vLLM), Inderjit Dhillon (from UT Austin and Google, a veteran in the ML field) - and a group of Berkeley PhDs.
When these three forces bet on the same direction at the same time, you should take a serious look.
The framework they proposed is called FST (Fast - Slow Training). Its core idea is extremely simple:
Don't let a single set of parameters undertake two contradictory functions at the same time.
In traditional RL training, the model has only one set of parameters.
It has to "quickly adapt to the particularity of the current task" and "retain general reasoning ability."
These two things are naturally in conflict: the former requires parameter drift, while the latter requires stability.
FST's approach is to distribute these two things to two sets of "weights."
They are updated alternately - the slow weights are adjusted by RL every once in a while, and the fast weights are automatically evolved by a prompt optimizer called GEPA.
This is exactly how your brain works.
In their blog, the GEPA team directly cited the "Complementary Learning Systems" theory:
Your hippocampus is the "fast weight" of your brain. It can remember what your colleague said during a meeting this afternoon within minutes.
Your neocortex is the "slow weight." It takes months or even years to slowly precipitate the truly valuable details into the long - term structure.
New memories are never directly written into the long - term structure of the brain.
They are first "temporarily stored" in the hippocampus, replayed repeatedly during sleep, and only a tiny fraction is slowly permeated into the neocortex - the rest is forgotten.
FST is the first to give large models this hierarchical structure.
The numbers are also impressive.
FST achieved the same performance as RL on the CodeIO task with only 1/3 of the training steps - 3 times the data efficiency.
When the matching accuracy is the same, the KL divergence (measuring distribution shift) between the model trained by FST and the base model is 70% lower than that of RL - 70% less forgetting.
The most critical test is the plasticity test: After training on the Math task and then training on HoVer - hard, the model trained by RL can hardly learn the new task at all (plasticity collapses to nearly 0), while the model trained by FST can almost return to the level of the base model and continue learning.
This is a quantum leap.
Of course, FST is not a perfect algorithm. GEPA and CISPO can be replaced by any other prompt optimizer and RL algorithm, and its engineering implementation is still in its infancy.
What's important is not whether the specific method of FST can work - what's important is that the "fast - slow division of labor" it proposes, as a paradigm language, has for the first time turned continual learning from a fantasy into an engineerable direction.
The Consensus That Hasn't Formed Yet
The consensus is forming but hasn't formed yet.
This is the real situation.
The industry has a different timeline.
Ilya Sutskever believes that superintelligence should be redefined as a continual learner rather than a completed AGI.
He estimates that continual learning will take another 5 to 20 years.
Ilya has always been more conservative than the industry consensus, but his conservative judgments have always been more accurate. The range of 5 to 20 years means that even Ilya admits that this problem will be solved; the only difference is in the pace.
Karpathy's view is more subtle.
In his opinion, continual learning is a real problem, and the existing paths are not enough to solve it. His doubts are at the implementation level, and he doesn't oppose the direction.
But things are already in motion.
The era of reasoning started in 2024 and ended in 2026.
The era of continual learning started in 2026, and the next round of competition won't wait until 2027.
Reference materials:
https://arxiv.org/pdf/2605.12484
https://gepa-ai.github.io/gepa/blog/2026/05/11/learning-fast-and-slow/
https://x.com/daniel_mac8/status/2055975372345274519