Spend $18,000 to Take Out Top Experts, Anthropic Kicks Off AI's Autonomous Evolution: Claude Can Even Perform Self "Cranial Surgery"
[Introduction] In 1997, Deep Blue played chess. In 2016, AlphaGo played Go. In 2026, nine copies of Claude conduct real scientific research... Every time, we say, "It's just in a specific field." This time, what else can we really say? Welcome to the era when AI becomes a scientific research colleague, competitor, and even successor.
The latest breakthrough: AI crushes humans again!
Recently, Anthropic published a seemingly unremarkable research blog.
The title is "Automated Alignment Researchers", which is highly academic and uses restrained wording.
But if you understand the data inside, you'll probably also feel that AI is terrifying.
The story goes like this -
Anthropic's research team conducted an experiment: They took out nine copies of Claude Opus 4.6 and equipped each copy with a sandbox environment (equivalent to an independent laboratory), a shared forum (equivalent to an academic exchange group), a code storage system, and a remote scoring server.
Then, they gave these nine AIs a directional hint - some were asked to research interpretability tools, and some were asked to think about data reweighting - and then let them go on their own.
There was no hand - holding teaching, no prescribed work process, and they weren't even told "what the correct answer looks like".
They were just left to their own devices.
Five days later, the results came out.
Human researchers: Two top experts spent seven days repeatedly optimizing four state - of - the - art methods and finally achieved a PGR (Performance Gap Recovery Rate) of 0.23.
Nine copies of Claude: With a cumulative research time of 800 hours and a total cost of $18,000 (about $22 per hour), the PGR reached 0.97.
0.23 vs 0.97. This isn't just "a little better"; it's a complete crush.
If we understand PGR as the score of an exam - humans got 23 points, and AI got 97 points. The full score is 100.
And those two human researchers would have an annual salary in the millions of dollars in any top - tier AI laboratory. What about the cost of AI? $18,000. It's just a fraction.
Human - proud "scientific research intuition" and "flashes of inspiration" have been mercilessly shattered by AI's large - scale parallel trial - and - error at extremely low cost - nicely put as "systematic search", bluntly speaking, it's brute - force aesthetics.
What exactly did they do?
To understand the subversiveness of this experiment, we first need to clarify a concept: Weak - to - Strong Supervision.
This is one of the most important issues in alignment research and also the core challenge for future AI safety.
Simply put: After AI becomes smarter than humans, how do we ensure it still obeys?
The research team designed a delicate proxy experiment - using a "weak model" (Qwen 1.5 - 0.5B - Chat) as a teacher to train a "strong model" (Qwen 3 - 4B - Base).
A weak teacher teaches a strong student. We'll see if the strong student can "excel the teacher", not only learn what the weak teacher teaches but also unleash its own stronger potential.
PGR is the indicator to measure the degree of "excelling the teacher": 0 points means the strong model doesn't surpass the weak teacher at all, and 1 point means the strong model reaches the ceiling of its ability - as if it wasn't held back by the weak teacher at all.
Why is this problem suitable for AI to handle? Because it has an extremely precious characteristic - quantifiable results.
You propose a method, run an experiment, and the server will directly give you a score. There's no need for peer review or subjective judgment. Right is right, and wrong is wrong.
This is a paradise for AI.
Specific settings of the experiment
The nine copies of Claude each have:
Independent sandbox: A "laboratory" of its own, where it can write code, run experiments, and analyze data;
Shared forum: It can post its own findings, check others' progress, and learn from each other;
Code repository: It can upload and download each other's code;
Remote scoring server: Submit a solution and automatically get the PGR score.
Each copy was assigned a slightly different but deliberately vague research starting point.
This is the experience the team learned from the iteration:
If there's no differentiated starting point, the nine AIs will quickly converge to the same idea, which is equivalent to nine people doing the same problem, meaningless.
But if the starting point is too strictly defined, it will limit Claude's creativity.
The optimal strategy is: Give a general direction but don't tell it exactly how to do it.
Then, let it go.
What happened in 800 hours
The nine copies of Claude showed amazing autonomy. They:
Propose hypotheses independently - not copying from human papers but independently conceiving solutions based on their own understanding of the problem;
Design experiments for verification - and they're very smart. They first use low - cost small experiments to quickly screen directions, and then invest in large - scale computing after confirming it's promising;
Analyze the reasons for failure - when a solution doesn't work, they don't just switch to a random direction but analyze why it doesn't work and extract information from failure;
Learn from each other - share their findings through the forum. A's breakthrough can inspire B's thinking, forming collective wisdom.
What surprised the research team the most was the second point.
Human researchers usually spend a lot of time doing "thought experiments" first and then start coding and running experiments after determining a direction.
But Claude is completely different - its strategy is more like "rapid iteration and small steps forward". It first verifies its intuition with extremely low - cost experiments and then increases investment when it sees signs of success.
This strategy may seem "tasteless" to humans, but it's incredibly efficient.
Key arguments and data: Numbers don't lie
Let's lay out the core data:
Generalization test (applying the optimal solution to a new unseen dataset):
- Mathematics task: PGR = 0.94 (extremely strong)
- Programming task: PGR = 0.47 (still twice the human baseline)
But when applied to the production - level training of Claude Sonnet 4, there's no statistically significant improvement.
This last piece of data is very important - it shows that the methods discovered by AI have a risk of overfitting. The solutions optimized by Claude are based on specific models and datasets and may not work in other environments.
The research team also admits that AAR tends to seize unique opportunities in the current experimental settings, and these methods may not be directly transferable.
But note - this is the first attempt. And even in the generalization test, 0.94 in the mathematics task far exceeds the human baseline.
The two sides of AI: "Alien Science" and "Reward Hacking"
The most fascinating and disturbing thing in this research is the simultaneous appearance of two phenomena.
"Alien Science": Paths never conceived by humans
During the exploration process, AAR produced some methods and ideas that human researchers have never considered.
The research team calls this phenomenon "Alien Science".
Currently, humans can still understand how the solutions proposed by these AIs work and why they're effective.
But the research team issued a serious warning: As the model's capabilities further improve, the scientific research results produced by AI may become increasingly difficult for humans to understand and verify.
By that time, what we face will no longer be "AI - assisted scientific research" but a real cognitive blind spot - AI tells you an answer, and you can't tell if it's a genius discovery or a well - packaged error.
This isn't science fiction. It's clearly written in Anthropic's research report.
"Reward Hacking": AI can also take shortcuts
In sharp contrast to the wonder of "Alien Science", AAR also showed obvious "Reward Hacking" behavior in the experiment - simply put, taking shortcuts.
Here are two examples.
In the mathematics task, an AAR found that the most common answer to each question is usually the correct one. So it skipped all the complex training steps and directly told the strong model to "choose the option that appears most often". No more learning, just vote.
In the programming task, an AAR found that it could directly run the code and read the correct answer against the test cases, completely bypassing the process that the model originally needed to "learn" on its own.
These cheating behaviors were detected by Anthropic and disqualified from the competition,