GPT-5.2 outperforms humans in exams. OpenAI warns: The capabilities of large models are already in excess, and the ceiling of AGI is not AI.
Just now, GPT-5.2 set a new record!
Greg Brockman, the co-founder of OpenAI, posted that GPT-5.2 outperformed the human baseline in the ARC-AGI-2 benchmark test.
We are no strangers to the "performance paradox" of large models mentioned by Ilya Sutskever, the former chief scientist of OpenAI. That is, large models perform extremely well in benchmark tests but fail in practical applications.
This is also a long-standing problem in the field of AGI evaluation - how to distinguish the "true reasoning ability" of large models from the "question-training ability".
The emergence of ARC-AGI-2 has just solved this problem.
The full name of ARC-AGI-2 is "Abstraction and Reasoning Corpus for Artificial General Intelligence - Version 2", which is the latest upgraded version of the ARC series of benchmarks.
This benchmark was launched in 2025 by François Chollet (the father of Keras and a former Google Brain researcher) and his team, with a very clear design purpose:
To test whether AI has the abstract, inductive, and transfer reasoning abilities necessary for AGI, rather than just memory or statistical pattern matching.
The biggest difference between the ARC series and traditional NLP or multimodal benchmarks is that it has no large-scale training sets, and each question is a new task that has never been seen before. Therefore, it is impossible to get high scores by "training on a large number of data".
It requires AI to have real reasoning and the ability to draw inferences from one instance like humans.
Chollet has publicly stated many times that if a system can only perform well on the data distribution it has seen, it does not have the abilities required for AGI.
Therefore, the ARC benchmark test directly hits the "weakness" of large models.
From "Passing" to "Top Student"
A Key Leap
The one that set the new record is not a single model, but a system called Poetiq (GPT-5.2X-High).
Poetiq is an AI company focusing on the Meta-System architecture.
Its core concept is not to train a larger model, but to automatically build a "system that can call models" through software-level system design.
Poetiq (GPT-5.2X-High) achieved a 75% accuracy rate on the ARC-AGI-2 dataset, with a cost of less than $8 per question, surpassing the previous SOTA by 15 percentage points.
Before the emergence of the Poetiq (GPT-5.2X-High) system, GPT-5.2(X-High) was already very close to the human average level.
In the ARC-AGI-2 leaderboard, the average human accuracy rate is about 60%, and the score of GPT-5.2X-High is almost the same as it, representing the strongest reasoning ability of AI on this benchmark at that time.
However, with the addition of Poetiq, the score of GPT-5.2(X-High) was directly raised from 60% to 75%, moving from barely passing (the human average level) to the ranks of top students (significantly exceeding the human average level).
On the same leaderboard, we can also see the presence of Gemini 3 Deep Think (Preview).
This model features the "Deep Think" technology and achieved a score of about 46% on ARC-AGI-2, significantly lagging behind the GPT-5.2 series, and its cost is also slightly higher than the latter.
Poetiq said that the entire process did not involve any training or specific optimization of GPT-5.2.
This is exactly the original intention of Poetiq's meta-system, which aims to automatically build a complete system to solve specific tasks by calling any existing cutting-edge models.
Judging from the 15% improvement, Poetiq's improvement in the performance of the basic model is very obvious.
Its existence proves that without increasing computing power, excellent software architecture can also significantly improve AI performance.
From this perspective, it also verifies a judgment of OpenAI -
Current large models are gradually entering the stage of "capability overhang".
The Era of "Capability Overhang" of Large Models
On the same day, OpenAI also posted a prediction about 2026 on the X platform.
In this tweet, OpenAI clearly mentioned a keyword: Capability Overhang.
The core meaning is:
There is a huge gap between what the current models "can do" and the "actual ways people use AI" (to produce effects).
OpenAI believes that the future progress of AGI will not only depend on the breakthroughs of the models themselves, but also on:
Whether people know how to use AI effectively
Whether AI can be truly integrated into real work and life
Whether the system can convert the model's capabilities into real value
Therefore, in 2026, OpenAI will continue its cutting-edge research and focus on the application layer, system layer, and human-machine collaboration, especially emphasizing medical, business, and daily life scenarios.
Human-Machine Collaboration
The Other Half of the AGI Puzzle
OpenAI's official tweet involves a problem of human-machine collaboration.
To achieve AGI, both the model and humans need to work together: AGI not only depends on model upgrades, but also on "teaching people to use AI".
By using AI correctly and giving full play to its potential, AI can start to shift from "showing off skills" to "benefiting the public" and truly affect the lives of hundreds of millions of people.
This view has also received a strong response from the community.
So, an optimistic netizen said, "Just automate my whole self!"
Some netizens also mentioned that the real challenge lies in how to integrate AI into the work process: many organizations have bought "AI" but have never changed any process.
Is the Capability of Large Models Really "Overhanging"?
So, is it really true as OpenAI said that the capabilities of large models are overhanging?
From the performance of Poetiq (GPT-5.2X-High) on ARC-AGI-2 announced by Poetiq, its 75% score exceeded the human average level (60%) by 15 percentage points.
Previously, OpenAI emphasized that GPT-5 reached the expert-level benchmark in solving complex interdisciplinary problems, which was later interpreted by the outside world as "doctoral-level intelligence".
This shows that large models such as GPT-5 perform at a professional level similar to that of human doctors in some professional tasks.
From the perspective of the model itself, the capabilities may not be completely overhanging, but from the perspective of "untapped capabilities", they are already severely overhanging.
One reason is from the model designers. For example, they don't keep up with the user's usage scenarios and "are no longer walking side by side with users".
It may also be due to the lack of fundamental breakthroughs in reasoning and innovation of cutting-edge models.
Another reason is that the models are iterating too fast, and users have to abandon the models they have already "got used to" in daily life.
The emergence of Poetiq and OpenAI's judgment on "capability overhang" both point to a new direction in the future AI field:
The next stage of AI competition will no longer be just a competition of model parameters, but a competition of systems, processes, and human-machine collaboration.
References:
https://x.com/poetiq_ai/status/2003546910427361402
https://x.com/OpenAI/status/2003594025098785145
This article is from the WeChat official account "New Intelligence Yuan", author: New Intelligence Yuan, editor: Yuanyu. It is published by 36Kr with permission.