World's Top Large Models Suffer Severe Blow Overnight: Humans Score Full Marks, Top-Ranked AI Gets Just 0.2% in Toughest Test

Tonight, the entire AI community was shaken. As soon as the world's most difficult AGI test, ARC-AGI-3, was launched, it silenced the world's top AIs. Humans passed the test with full marks, while the strongest model, Opus 4.6, scored only 0.2%, less than 1%. AI has been set back to the "primitive age" overnight.

Just today, this piece of news shook the entire AI community.

As expected, the world's only unsaturated agent benchmark test, ARC-AGI-3, has been released, directly overwhelming the world's top large models.

In this test, humans scored 100%, while the scores of AI were generally below 1%.

This gap is higher than Mount Everest.

The most tragic thing is that the "model student" Opus 4.6, which could still score a high 69.2% in the previous generation of tests, was exposed in front of ARC-AGI-3, scoring only 0.2%.

This former "top student" who swept various lists couldn't even get 1 point by guessing.

This mirror reveals the deepest crack in the current AI capabilities.

In a recent interview, Huang (Jensen Huang) thought that we had achieved AGI. However, ARC-AGI-3 shows that perhaps today's AI has not even achieved 1% of AGI.

How crazy is ARC-AGI-3?

Its predecessors, ARC-AGI-1 and ARC-AGI-2, were already well-known "devil tests" in the AI community.

In those tests, AI needed to observe several examples and then infer the rules of grid transformation to complete new tasks.

Does it sound easy? But these things that look like kindergarten connection problems once made countless large models fail.

When it comes to ARC-AGI-3, the difficulty has changed to a whole new dimension: from "static questions" to "interactive games".

There are more than 150 manually designed interactive game environments, containing more than 1000 levels.

Each game has its own internal logic, hidden rules, and clearance conditions. But there is no instruction manual, no natural language prompts, and no one tells you "the left button will open the door" or "collect three red squares to pass the level".

The AI agent is thrown in and can only see the current screen, choose an action, observe the result, and then decide the next step.

It can only grope step by step like a blind man feeling an elephant and then piece together a model in its "brain" about "how this world might work".

This is exactly what the ARC Prize Foundation wants to test in four aspects.

Exploration: Can it obtain key information through active interaction with the environment?

Modeling: Can it condense scattered observations into a world model that can predict future states?

Goal acquisition: Without anyone giving instructions, can it judge by itself "what should I aim for"?

Planning and execution: Can it plan an action path and correct it at any time according to environmental feedback?

The "geometric series" humiliation: Where does 0.2% come from?

The scoring standard is equally cruel.

The scoring of ARC-AGI-3 doesn't look at "whether it has passed the level" but at "efficiency", and it is compared with human efficiency.

This is the first time in the history of AI benchmark tests.

Inspired by Chollet's "On the Measure of Intelligence", the ARC Prize team operationalized "intelligence" into a conversion rate:

How efficient are you at obtaining information from the environment? How fast can you convert this information into correct actions?

Suppose it takes humans 10 steps to solve this game, and AI takes 100 steps. What is the score of AI?

It's not 10%, but 1%.

The formula is: (human steps / AI steps)². If humans take 10 steps and AI takes 100 steps, then it is (10/100)² = 0.01 = 1%.

If AI takes 200 steps, this figure is 0.25%; if it takes 500 steps, it is 0.04%.

This blocks all the "brute force" paths of AI.

Previously, AI could try all possible operations by brute force and would eventually find the correct path.

But under this scoring system, for each additional step you try, the score drops precipitously.

Now, you know what it means that Opus 4.6 only scored 0.2% —

Suppose it takes humans 10 steps to solve a certain game. 0.2% = 0.002, the square root is approximately 0.0447, and 10 ÷ 0.0447 ≈ 224 steps.

This is no longer "stupid"; it's like spinning in circles in a maze forever.

When this gap is so strongly shown, many people who thought AGI was just around the corner were shocked.

350 steps vs. just a few clicks: The full picture of the report card

Before the official release, ARC-AGI-3 ran a 30-day developer preview.

The three public games range from map navigation to pattern matching and water level adjustment. The question types are diverse, but they have one thing in common: humans find them easy, while AI finds them extremely difficult.

More than 1200 human players participated in the test and completed more than 3900 games.

Most people not only passed the levels easily but also had a great time. Some persistent players even "speed-ran" all the way to the theoretically optimal number of steps.

The human baseline is 100%. On the AI side, the scores of all cutting-edge large models are below 1%.

The champion during the preview period is called StochasticGoose, from Tufa Labs.

It is not a large model but an action-learning agent based on a convolutional neural network, which uses simple reinforcement learning to predict which operations will cause changes in the screen. It finally scored 12.58%, which is already the highest among all participating systems.

But even this champion spent nearly 350 ineffective click operations at the beginning of a water level adjustment game.

350 steps. Humans only need to click a few times to figure it out.

What's more counterintuitive is that the top three in the leaderboard are all non-LLM solutions — CNN, rule-based state graph exploration, and frame graph search without training.

A CNN-based solution scores more than 12 percentage points higher than the GPT-5.x series. And those agents connected to cutting-edge large models often end up at the bottom of the list, and some even crash frequently.

AI shoots itself in the foot

The ARC team also found a very interesting phenomenon.

One of the main failure modes of AI is: "thinking it is playing another game".

For example, you are blindfolded and thrown into a room.

You touch a round object, so you conclude: "This is a basketball court, and I should shoot a basket." But in fact, what you are holding may be a watermelon, and the room is actually a kitchen.

This is the kind of mistake AI makes.

In a brand-new environment, it sees some initial visual information, then quickly "imagines" a game framework for itself, and then frantically executes the plan along this wrong assumption, going further and further astray.

It won't stop and think: Wait, why don't I seem to be getting positive feedback? Is my assumption wrong?

Because current AI lacks a "meta-cognitive" ability. That is, it doesn't know what it doesn't know.

This explains why large models end up at the bottom of the list.

The larger the number of parameters and the richer the pre-trained knowledge of the model, the more likely it is to "imagine" a strange environment as something it has seen before and then stick to it.

On the contrary, those lightweight CNN agents and graph search systems can learn honestly from environmental feedback because they don't have the burden of "preconceived notions".

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The world's top large models were severely hit overnight. In the most difficult test, humans scored full marks, while the top-ranked AI only got 0.2%.

How crazy is ARC-AGI-3?

The "geometric series" humiliation: Where does 0.2% come from?

350 steps vs. just a few clicks: The full picture of the report card

AI shoots itself in the foot