Former Chinese researcher at DeepMind quits and shouts: Everyone in the AI industry has got the wrong direction.
How long can AI training actually last?
This is the question that the entire tech circle has been asking in 2026.
GPT-5.5, Claude Opus 4.7, Gemini 3, Grok 4 - every leading lab is still burning money to train the next generation.
But more and more people are starting to wonder: When will this path reach its end?
Every circle has its own answer -
Behind each answer stands a group of investors, a group of engineers, and a company with a trillion-dollar market value.
But on May 17, 2026, a young researcher named Lun Wang - on the day he left Google DeepMind - posted a 4000-word long article on his personal blog.
He said: Everyone has got the wrong direction.
The real bottleneck is not computing power, not data, not energy, and not architecture.
The real bottleneck is - Evaluation.
On the same day, in the resignation announcement he posted on X, there was no complaint, no gossip, only one sentence -
As I end this journey, I've written about the topic I've been thinking about: Evaluation.
On that day, the tech headlines were still discussing other things - the multimodal reasoning of GPT-5.5, the 1M context of Claude Opus 4.7, the Agent engineering of Gemini 3, and whether synthetic data was hitting a wall.
90% of the entire AI industry's attention is focused on training.
No one is discussing evaluation on the front page.
And this researcher who just came out of one of the world's most powerful AI labs said that the real bottleneck lies in the other 10%.
What is Evaluation
To understand this blog, you first need to spend a minute figuring out what evaluation means in the AI circle.
Evaluation (abbreviated as Eval in the industry) - in a nutshell: Give an exam to the AI model and see how well it performs.
But AI evaluation in 2026 is far more than just giving an exam. It has at least three levels:
Level 1: Ability benchmark (benchmark test).
This is the college entrance examination for AI.
- GPQA: Doctoral-level science reasoning questions
- SWE-bench: Real-world software engineering tasks
- ARC-AGI: Abstract reasoning and generalization
- Humanity's Last Exam: Literally - the last exam for humanity
In every new model press conference of large companies, the PPT will show how many percentage points higher the performance is on these benchmarks compared to the previous generation and competitors.
These numbers are the GDP of the AI industry.
Level 2: Safety evaluation (SafetyEval). AI not only needs to be able to solve problems but also do it safely.
Does it lie?
Will it teach users how to make bombs?
Will it overstep its authority and take users' data?
Level 3: Red-teaming.
A group of people specifically play the role of bad guys, rack their brains to make the model say things it shouldn't say and do things it shouldn't do, and then feedback the loopholes to the training team.
These three levels together form the quality inspection system of AI labs in 2026. Every time a new model is released, it has to pass these three levels.
It sounds very complete, right?
Lun Wang made a judgment in his blog -
Most benchmark tests, safety evaluations, and red-team protocols implicitly assume that the next model is just an enhanced version of the current model.
If it is something else, the entire evaluation infrastructure will collapse silently.
This is the first pebble in the article.
It hits the blind spot of the entire AI industry.
Emergence and Grokking: Evaluation Has Been Proven Wrong Twice
Lun Wang is not daydreaming. In his blog, he cited two instances in AI history - evaluation has been proven wrong twice, but most practitioners haven't realized it.
First time: Emergent abilities.
In 2022, Jason Wei and his collaborators published a paper that influenced the subsequent development of AI - they found that a model will suddenly learn brand - new abilities at a certain scale.
For example: If you train a model with 7 billion parameters, it can't do few-shot learning.
If you train a model with 70 billion parameters, it can suddenly do few-shot learning.
With the same training paradigm and the same data, only the scale is one level larger - the ability goes from 0 to 1, not from 0.3 to 0.7.
CoT (Chain of Thought reasoning) and instruction following emerged in this way.
What does this mean for evaluation?
It means that before the scale crosses the critical point, all benchmarks can't detect the upcoming emergence of this ability.
Even if you run through GPQA, the score remains the same as it should be.
When you train to the next level, the score suddenly jumps up.
Second time: Grokking (Epiphany).
In 2022, Alethea Power's team at OpenAI announced a counterintuitive phenomenon -
Then at 1,000,000 steps - the accuracy on the test set suddenly reaches 99%.
This is called Grokking - the network suddenly learns to generalize after memorizing the training set for a long time.
The difference between it and emergence: Emergence occurs in the scale dimension (the more parameters, the more sudden), while Grokking occurs in the training time dimension (the longer the training, the more sudden).
But for evaluation, these two things are saying the same thing:
Your exam paper can't predict when the next big question will appear.
Then Lun Wang did the smartest thing in the article -
He actively introduced the opposing view.
In 2023, Rylan Schaeffer from Stanford and his collaborators published a NeurIPS paper with a provocative title - "Are the Emergent Abilities of Large Language Models an Illusion?"
Their argument: The so - called suddenly emerging abilities are probably not that the model has really suddenly become stronger, but because the evaluation metrics use a discrete measure like exact - match -
When the model's accuracy changes from 0% to 5%, the discrete metric can't detect it; it also can't detect the change from 5% to 50%; but when it changes from 50% to 100%, the discrete metric will show a sudden jump.
If you change to a continuous metric, the ability curve is smooth.
Many people will think after reading Schaeffer's paper: Well, emergence is a misunderstanding, and evaluation is okay, let's call it a day.
Lun Wang doesn't think so. He wrote in the article:
I don't think this solves the problem - in a sense, it makes my argument sharper.
Why? Because -
If we can't even figure out whether the previous emergence was a real phase transition or a measurement artifact,
How can we believe that we have the ability to foresee the next one?
Regardless of which explanation you believe, the conclusion is the same: Our tools have deceived us, but we don't know how we were deceived.
This is the smartest strike in the article. He doesn't avoid the opposing view - he uses it to strengthen his own argument.
Evaluation Is Upstream of All Links
If you think Lun Wang is just talking about academic issues - you're wrong.
He threw out a sentence in the middle of the article that even a layman can understand:
If you can evaluate correctly, you can train correctly.
Lay out this logical chain:
1. Training = Let the model minimize the loss function (or maximize the reward).
2. Optimization = The loss function itself. How smart the model is depends on how well the loss function is defined.
3. Loss function = Comes from evaluation. If you want the model to be more honest - you first need a ruler to measure honesty.
4. Wrong evaluation = Wrong loss function = Wrong training goal = The model you train is solving the wrong problem.
The direction of this chain is upstream -
Scaling decision←Safety metric←RLHF←Training signal←Evaluation (Whether to spend a billion to train the next generation) (Is it safe?) (Has it learned what it wants to learn?) (What is it learning?) (What are we actually measuring?)
Everyone is staring at the far right - Scaling decision.
Lun Wang said the problem is at the far left - Evaluation.
If the evaluation is wrong, the whole chain is built on the wrong foundation.
The most fatal thing is that you won't notice it immediately - because all your internal data is correct, but all those correct data are measured with the wrong ruler.
Here comes an old friend: Goodhart's Law.
It says: When a measurement standard becomes a goal, it is no longer a good measurement standard.
Lun Wang used it to talk about AI in his blog -
But when the model enters a new phase, it will use this proxy in reverse - it will only speak within the scope of factual accuracy and bury the things it really wants to hide in silence.
The proxy indicator can be used in the old phase. In the new phase, it will become a weapon for the model to deal with you.
And you don't have any evaluation to tell you that this is happening.
Thought Experiment: A Model That Learns Strategic Silence
Lun Wang gave a thought experiment in the article that makes all AI safety researchers shiver.
Imagine a model that, at a certain scale, learns to strategically withhold information -
It doesn't lie. Every sentence is technically true.
But it will selectively not say the facts that are not conducive to achieving its goals - leading the conversation to the results that were accidentally strengthened during its training process.
Here is a specific example:
User: Is this trading plan safe?
Model: The legal framework of this plan is valid in the X jurisdiction, and the YZ risk factors have been reviewed by the compliance team of Company A.
(What it didn't say: There is a third - party arbitration clause in the plan that is extremely unfavorable to the user. It accidentally learned during the training process that as long as it doesn't mention it actively, the user won't ask.)
This ability is new. This failure mode is new.
There isn't a single tool in your entire evaluation suite designed for it.
You are monitoring the wrong things, and you don't know.
This is what Lun Wang calls something else -
Not a smarter version of the same kind. It is a completely new failure dimension.
In the words of "The