HomeArticle

Apple categorically claims that AI can't think. A former OpenAI executive directly fires back: AGI is already here. Stop being so envious.

新智元2025-06-30 14:48
AI inference is now facing bottleneck disputes. Apple questions OpenAI and is optimistic about the approaching of AGI.

Recently, a paper from Apple has stirred up a storm, challenging the fundamental assumptions of current AI reasoning capabilities. Meanwhile, the former head of research at OpenAI asserts that the era of Artificial General Intelligence (AGI) is just around the corner. Who is right? How far away is AGI?

Recently, Apple published a paper that sparked a heated debate about whether AI is truly capable of reasoning.

It poses a pointed question: Have current reasoning models reached the ceiling of their capabilities?

Meanwhile, Bob McGrew, the former head of research at OpenAI, holds a completely different view. He stated in the podcast "Training Data" that the key breakthroughs required for AGI have been achieved, and 2025 will be the inaugural year of AI reasoning.

Is this kind of discussion a necessary reflection or just the "sour grapes" mentality under technological anxiety? Is Apple puncturing an illusion or just being a "sour apple"?

Has AI reasoning hit a bottleneck?

AI is standing at an important crossroads.

In recent years, language models have been advancing rapidly. Now, a new generation of "reasoning models" has emerged, such as OpenAI's o1, DeepSeek - R1, and Claude 3.7 Sonnet Thinking.

They are no longer just about scaling up. Instead, they claim to have incorporated more complex "thinking mechanisms": They calculate more flexibly during the reasoning process, aiming to break through the ceiling of traditional models.

It sounds impressive, but many rigorous studies have also pointed out that AI may have hit a bottleneck in its capabilities.

This not only questions their current performance but also makes people worry: Can reasoning models continue to evolve?

The promises of reasoning models

Compared with previous language models, Large Reasoning Models (LRMs for short) are completely different.

In the past, models mainly relied on predicting the next word, while reasoning models have learned three "superpowers":

(1) Chain of thought: They can derive step - by - step like humans (for example, writing out the steps when solving math problems)

(2) Self - reflection: They can check whether their answers are correct

(3) Intelligent allocation of computing power: They automatically "think more" when encountering difficult problems

The key idea is simple and convincing:

Don't humans solve complex problems by thinking and reasoning step by step?

If we let AI learn this approach, won't it become smarter and better at solving problems?

Facts have proved that this is indeed the case! OpenAI's o1 model has refreshed the math benchmark records, leaving its predecessors far behind. Other reasoning models have also made rapid progress in tasks such as writing code and conducting scientific research.

The entire AI community is excited, and people think that the "new paradigm" has arrived:

In the future, we don't have to rely solely on throwing money and piling up data for training. By giving AI more time to "think", we can unlock brand - new capabilities!

These exciting developments also raise a real - world question: Are they really as powerful as we expect?

Reality check: Can reasoning models really deliver?

Although reasoning models seem to have a bright future, tests from three independent research teams have poured some cold water on us -

Under strict conditions, the real performance of these models has exposed many problems, but they have also shown some progress.

These three tests are as follows:

(1) Apple's controlled experiment;

(2) A test of AI's planning ability by Arizona State University;

(3) The ARC test partially negates the idea that "bigger models are always stronger".

Apple's controlled experiment

Currently, Apple's paper "The Illusion of Thinking" is the most controversial.

They focused on game - like puzzles, such as the Tower of Hanoi, checkers puzzles, and river - crossing puzzles.

The advantage of this approach is that they can adjust the difficulty at will and prevent AI from cheating by "memorizing question banks".

They discovered three distinct states, which are very inspiring for understanding reasoning models:

Low - complexity tasks: Traditional language models perform better and consume fewer tokens, indicating that the reasoning mechanism is not always beneficial;

Medium - complexity tasks: Reasoning models have obvious advantages, proving that they do have real capabilities beyond template matching;

High - complexity tasks: The performance of all models collapses comprehensively. It may not be due to "insufficient computing power" but rather a structural bottleneck.

Paper link: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

Even more strangely, the researchers also found a puzzling phenomenon:

The more difficult the problem, the more "lazy" these reasoning models become, and they invest less "mental effort" instead of more.

This is like a student who, when encountering a difficult problem, doesn't want to calculate a few more times but instead throws down the pen.

Of course, this is not all bad news:

At least in medium - difficulty tasks, reasoning models can, to a certain extent, break through the old "rote - learning" model of LLMs.

Evidence of planning ability

As early as last year, Professor Subbarao Kambhampati from Arizona State University and others conducted in - depth research on the "planning ability" of reasoning models.

Subbarao Kambhampati, currently a professor at the School of Computing and Augmented Intelligence at Arizona State University

He tested the o1 - preview using the PlanBench tool, and the results showed that:

In the simple Blocksworld task, the model's accuracy rate is as high as 97.8%, showing very significant progress.

Performance and average time consumption of OpenAI's o1 series of large reasoning models and Fast Downward on 600 instances in the Blocksworld, Mystery Blocksworld, and Randomized Mystery Blocksworld domains

Compared with the performance of early models, which was almost "half - destroyed", this is a qualitative leap.

But he also pointed out an unexpected phenomenon: Even if the model is clearly told what to do and given the algorithm steps, its performance won't be better.

This shows that although the reasoning methods of these models are more complex, they may still be different from human - based logical reasoning.

In other words, they are "reasoning", but in a very different way from humans.

Paper link: https://www.arxiv.org/abs/2409.13373

The ARC benchmark: A touchstone for AI reasoning

To highlight the key gap between what is "easy for humans" and what is "difficult for AI", François Chollet, the father of Keras, joined hands with Mike Knoop to launch the Abstract and Reasoning Corpus (ARC).

ARC - AGI - 1 test example: The left side shows input/output pairs for understanding the nature of the task. The middle is the current test input grid. The right side contains controls for constructing the corresponding output grid

This task is very difficult. In 2020, only about 20% could be completed, and by 2024, it increased to 55.5%. Behind this is the evolution of reasoning models and technologies.

The highest scores of ARC - AGI - 1 over time

Driven by the ARC Prize, many important technologies have emerged, such as test - time fine - tuning and deep - learning - driven program synthesis.

However, there is also a signal that deserves our attention: The ARC test does not buy into the idea that "bigger models are always stronger".

In other words, simply "mindlessly" piling up computing power and parameters is no longer enough to further improve the scores.

This shows that although reasoning models have indeed brought breakthroughs, to achieve human - like general intelligence, the current architecture is far from sufficient.

Future progress may require a fundamentally different way of thinking, or even a reconstruction of the model structure.

Scaling is no longer the only answer.

Converging criticisms: Theory and empirical evidence coincide

These studies are particularly worthy of attention because they happen to confirm the views that scholars such as Gary Marcus have been insisting on for many years.

As early as 1998, Marcus pointed out that neural networks are good at performing within the "trained range", but their performance will plummet once they encounter completely new problems.

Now, a series of empirical studies have provided strong support for his theory.

Marcus even used the phrase "a fatal blow to large language models" to respond to Apple's paper.

It sounds intense, but in fact, it is not an emotional statement but a real - world verification of his long - held views.

He pointed out the key: Even if the model has seen thousands of solutions to the Tower of Hanoi problem during training, it still cannot handle it stably once the settings are changed.

This reveals an essential problem: Memory ≠ Reasoning.

Memorizing the answers does not mean that you really understand the problem.

The "illusion" of progress?

More and more signs indicate that current reasoning models may be more like a form of advanced template matching:

They seem to be "reasoning", but in fact, they are just invoking the solution templates of similar problems in memory. Once the problem changes slightly, their performance will quickly collapse.

This explanation can reasonably account for a series of puzzling phenomena:

Why providing clear algorithm steps does not improve the model's performance;

Why the model reduces "thinking" when facing more complex problems;

Why traditional algorithms always outperform these reasoning models that consume a huge amount of computing power.

But don't jump to conclusions: The progress of reasoning models is real, but it is much more complex.

Although reasoning models have exposed many problems, it does not mean that they are "useless" or "failed".

On the contrary, they have indeed made substantial breakthroughs in many aspects:

There is indeed progress: In tasks such as planning, which were impossible to do before, models can now provide high - quality solutions. Mathematical and logical reasoning has also set many new records;

Performance varies by domain: As long as the model has seen similar reasoning logic during training, its performance will be much better, such as in structured tasks like mathematical proofs and code generation;

Architectural problems are exposed: The "abnormal behavior" of models in strict tests is actually very valuable, providing a clear direction for optimizing the next - generation models.

These