HomeArticle

"The Agent's Final Exam", Fable 5 unexpectedly loses to GPT 5.5

量子位2026-06-12 12:54
However, they even failed to score a single point in the hardest difficulty level.

Unexpectedly, the slap in the face came so quickly!!

Just now, UC Berkeley released a brand - new benchmark test called "The Last Exam for Agents".

It brought today's strongest AI Agents to the "exam hall" and let them do real work --

Build 3D models in Siemens NX, create game scenes in Unreal Engine, and perform special - effect compositing in Adobe After Effects.

The results were astonishing:

In the most difficult level, the currently recognized strongest models, Claude Fable 5 and GPT 5.5, both got a big fat zero.

What if you lower the difficulty a bit? There were scores, but the results were still quite unexpected --

GPT 5.5 actually narrowly defeated Claude Fable 5.

Did I hear that right? Claude Fable 5, the strongest model just released by Company A, was defeated by GPT 5.5, which was released a few months ago??

You know, on almost all mainstream benchmarks before, Fable 5 completely outperformed GPT 5.5 -- 80.3% vs. 58.6% on SWE - Bench Pro, and 64.5% vs. 52.2% on Humanity’s Last Exam.

But in this "real - work" exam, the situation was reversed.

This new benchmark is called Agents’ Last Exam (ALE). The team behind it is quite remarkable. They are the ones who proposed well - known benchmarks like MMLU, MATH, CyberGym, and ExploitGym before.

It's probably named after Scale AI's "Humanity’s Last Exam". However, this time, it's not the limit of human knowledge being tested, but the limit of what AI Agents can do.

To be honest, after this evaluation came out, those who used to shout "Agents will replace human jobs" every day have really gone silent...

"The Last Exam for Agents", and the winner is actually GPT 5.5!

Let's first look at the complete leaderboard.

In terms of the core indicator of task pass rate, GPT 5.5 took both the first and second places:

The first place was GPT 5.5 paired with OpenAI's own Codex framework, with a pass rate of 24.0%.

The second place was still GPT - 5.5, but paired with the ALE Claw framework, with a pass rate of 23.0%.

(ALE Claw is a baseline Agent written by the team itself, competing alongside commercial frameworks like Codex, Claude Code, and Cursor CLI)

It wasn't until the third place that we saw Claude Fable 5 -- paired with Claude Code, it got a pass rate of 22.0%.

It gets even more interesting as you look further down.

The 4th, 5th, and 8th places were all GPT 5.5, just paired with different frameworks.

Among the top 10, GPT 5.5 appeared 5 times. Coupled with GPT 5.4 in the 6th place, OpenAI models directly occupied 6 spots.

And what about the Claude family?

Fable 5 took the 3rd place, Opus 4.7 took the 9th (18.4%), and Opus 4.8 came in last at the 10th (15.8%). The disadvantage is obvious.

No wonder OpenAI researchers posted happily, as if celebrating a happy new year:

Beyond the scores, there are also several signals worth pondering.

Firstly, the ceiling is surprisingly low.

The pass rate of the champion was only 24%, and the highest overall score was only 45.8%.

This means that even calculated by the most lenient "partial score" method, the strongest Agent can only get less than half of the points.

And all these questions come from projects already completed by real - life experts -- the completion rate of human experts is theoretically 100%.

Secondly, Claude is astonishingly money - burning.

A new column "Estimated Total Cost" was added to this leaderboard, which immediately showed the gap between the rich and the poor:

Fable 5 spent $2315 to complete all the tasks, Opus 4.8 spent $1838, and Opus 4.7 also spent $1144.

And what about GPT - 5.5?

The most expensive Codex only cost $566, and Cursor CLI only cost $174.

That is to say, Fable 5 spent more than four times the money of Codex but got two percentage points lower in scores.

Thirdly, the efficiency gap is also striking.

Ale Claw spent 47 hours and 20 minutes to complete all the tasks, while Cursor CLI only spent 67 hours.

And what about Opus 4.8? 451 hours -- nearly 19 days.

It did the least work, spent the most time, and cost the most money (is there really a model that can achieve all these at the same time?)

Of course, if we only look at the two top - tier models, Claude Fable 5 and GPT 5.5, GPT 5.5 still has an obvious time advantage.

And the most eye - catching number is still zero.

ALE divided the tasks into three difficulty levels:

Near - Term (solvable in the near future)

Full - Spectrum (comprehensive coverage)

Last - Exam (ultimate challenge)

In the most difficult level, the average pass rate of all mainstream configurations was only 2.6%, and most models, including GPT 5.5 and Fable 5, got a zero.

So the core message of this report card is simple: Don't be deceived by good test scores. When it comes to real work, the truth will be revealed.

A good test - taker doesn't necessarily mean a good doer. This also applies in the AI world.

What is ALE?

To understand why ALE can expose these "straight - A students", we first need to see how it is different from previous exams.

The previous Humanity’s Last Exam (HLE) was launched by Dan Hendrycks and Scale AI in early 2025. It consisted of 2500 interdisciplinary difficult questions. In essence, it was still a closed - book test --

You are given a question, and you give an answer. No matter how difficult it is, it's still a static knowledge search.

ALE is completely different. It tests what you "can do".

The core author, Yiyou Sun, said very straightforwardly on 𝕏:

There are predictions everywhere that AI agents will surpass humans in almost all jobs between 2026 and 2027. So we created this exam to verify this claim.

Each question in ALE comes from a project already completed by a real - life expert, covering 55 sub - industries, including quantitative trading, genomic analysis, aerospace engineering, architectural design, brain imaging, animation special effects, legal research...

The entire system is based on the US Federal Occupational Classification Standard (ONET)*. In simple terms, the questions are set according to the "real labor market".

The lineup of question - setters is also quite impressive:

More than 300 domain experts from over 100 institutions. On the academic side, there are MIT, Harvard, Stanford, Oxford, Caltech, ETH Zurich. On the industrial side, there are Goldman Sachs, JPMorgan, Meta, Amazon, Adobe, Oracle.

Snorkel AI provided financial support through the Open Benchmarks Grants project.

The exam format is not typing to answer questions, but directly operating the computer.

ALE uses the so - called GCUA framework (Generalist Computer - Use Agent). It gives the Agent full GUI and command - line permissions --

It can do everything a human can do on a computer, such as mouse clicks, keyboard typing, writing scripts, and browsing the web.

There are no restrictions on methods, only the results matter.

The submitted "assignments" are automatically scored by deterministic code.

No vibes. No human judges. Fully reproducible. (No relying on feelings, no relying on human judges, completely reproducible)

This plugs a long - standing problem in many previous benchmarks: The scoring system itself can be deceived.

In addition, ALE has a tough measure to prevent cheating --

Only about 10% of the questions (about 150 questions) are made public, and more than 1300 questions are strictly kept secret.

The public and private questions are regularly rotated to ensure that no model can get high scores by "memorizing questions".

In the context of the current widespread pollution of benchmark data, this is a quite ingenious design.

Overall, compared with existing Agent benchmark tests, ALE's positioning is very clear.

Dawn Song, one of the team members, specifically made a comparison:

The CLI subset of ALE (ALE - CLI) covers 40 sub - industries, while Terminal - Bench only covers 6, and SWE - bench - Pro only covers 5;

The time it takes for humans to complete these tasks ranges from a few hours to a few weeks, while for the latter two, it's from a few minutes to a few days;

The pass rate of the strongest Agent on ALE - CLI is only 25.2%, while on Terminal - Bench it's 82.0%, and on SWE - bench - Pro it's 59.1%.

In a nutshell, Other exams are almost exhausted, but ALE is still a long way to go.

This is why ALE dares to call itself "The Last Exam for Agents".

It's worth mentioning that Dawn Song also shared two interesting observations:

One is that Ag