OpenAI 3 Billionen US-Dollar Test: KI geht in den ersten Kampf gegen menschliche Experten aus 44 Branchen
In the second half of the AI era, AGI is already a thing of the past, and ASI is leading a new intelligent revolution! The GDPval evaluation system launched by OpenAI examines the potential of large models through real - world work tasks, revealing how AI moves from the laboratory to the $3 - trillion economic battlefield and helping humanity free itself from daily trivialities and embrace a creative future.
The second half of the AI era has truly arrived!
AGI is outdated. Now, what the AI industry is discussing is Super Artificial Intelligence (ASI):
AGI can free humans from 80% of their daily work;
While ASI is a system that comprehensively surpasses human intelligence.
Just recently, in an interview with a16z, Jakub Pachocki, the Chief Scientist at OpenAI, revealed that the next step in OpenAI's research roadmap is reasoning, and the key goal for the next five years is to create automated researchers:
AI automatically discovers new ideas, automates the work of researchers, and automates machine - learning research.
However, the clearest way to understand the potential of AI is not to predict the future, but to see what the models can already do now.
Historical experience tells us that from the Internet to smartphones, it takes more than a decade for each major technology to be born and become popular.
OpenAI hopes to demonstrate in a more transparent way how large models can truly serve the real world.
Therefore, they have launched a brand - new evaluation system, GDPval, to examine the progress trajectory of AI based on evidence rather than baseless speculation.
Paper link: https://cdn.openai.com/pdf/d5eb7428 - c4e9 - 4a33 - bd86 - 86dd4bcf12ce/GDPval.pdf
Dataset: https://huggingface.co/datasets/openai/gdpval
In GDPval, expert reviewers compared the outputs of top - tier models with the work of human experts.
Lawrence H. Summers, a professor and former president of Harvard University and a member of OpenAI's board of directors, believes the new research is exciting:
In many practical tasks, even with limited guidance, AI performs as well as or better than humans;
The combination of humans and AI can be more efficient;
AI has surprising capabilities that can be used to evaluate and subsequently improve its performance.
OpenAI admits that Claude Opus 4.1 performs best, performing as well as or better than experts in nearly half of the tasks, significantly outperforming GPT - 5.
However, OpenAI's progress rate is remarkable: within one year, the winning rate of GPT series models has almost doubled.
GDPVal: Measuring the $3 - Trillion Impact of AI
In the past, large - model evaluations often focused on academic tests or programming challenges.
Although these evaluations have played an important role in promoting the reasoning ability of models, there is still a certain gap between them and real - world work scenarios.
To fill this gap, OpenAI has gradually developed a series of more practical and economically meaningful evaluation methods —
From the traditional MMLU (exam - style questions covering multiple disciplines),
To the more practical SWE - Bench (software engineering bug - fixing tasks), MLE - Bench (machine - learning engineering tasks such as model training and analysis), Paper - Bench (logical reasoning and evaluation of scientific research papers),
To SWE - Lancer (freelance software development tasks from real - world transactions) based on market projects.
GDPval is the next key node in this evolutionary path.
This evaluation is directly derived from real - world work tasks, covering 9 major industries, 44 occupations, with a total annual economic value of $3 trillion.
The entire task set contains 1,320 highly specialized tasks (among which 220 are the gold - standard task subset, which has been open - sourced).
These tasks are derived from real - world work outputs, such as legal opinions, engineering drawings, customer - service conversation records, or nursing plans.
Each task must go through multiple rounds of strict review processes to ensure three points, namely: highly similar to real - world work scenarios; can be independently completed by professionals in the same field; have clear evaluation criteria.
Each task undergoes an average of 5 rounds of expert reviews. The review team includes other task writers, independent professional reviewers, and is supplemented by model feasibility and clarity checks.
The uniqueness of GDPval lies in that not only are the task contents close to reality and diverse in form, but it also has a high level of professionalism and representativeness.
Compared with traditional evaluations, GDPval is not a simple text - prompt task. It requires the model to process complete reference materials and work backgrounds, and the output form is not limited to text, but also includes documents, PPTs, charts, spreadsheets, and even multimedia content.
Of course, GDPval is currently just a starting point and has not fully covered the complexity of tasks in real - world knowledge work.
It helps us clearly recognize that large models can not only solve problems in the laboratory but may also play a reliable auxiliary role in the daily work of millions of people.
Please read again: AI is no longer just "passing exams"; it is starting to be evaluated by the assessment criteria of the civilization system itself: GDP.
Independent researcher Shanaka Anslem Perera said:
This is not just an evaluation system; it is more like the birth of a certain economic organism.
GDPval is the first accounting system in the post - human economic era.
Today, it is a "benchmark"; tomorrow, it will become the scoreboard for new species.
When the output of AI starts to be included in GDP, it is no longer just a tool but the fourth factor of production beyond "land, labor, and capital".
AI Approaches Professional Level in Half of the Tasks
Early test results show that current leading large models perform close to or even comparable to industry experts in some tasks.
In the 220 gold - standard tasks, industry experts blindly tested multiple mainstream models:
GPT - 4o, o4 - mini, OpenAI o3, GPT - 5, Claude Opus 4.1, Gemini 2.5 Pro, Grok 4.
The results show:
- Claude Opus 4.1 performs the strongest in aesthetic performance (such as document layout, PPT layout, etc.);
- GPT - 5 leads in accuracy, especially in locating professional knowledge points.
The output quality of the current most advanced large models is close to the level of industry experts. Among them, Claude Opus 4.1 performs particularly prominently —
In nearly half of the tasks, its output is rated as "as good as humans" or even "better than humans".
From GPT - 4o (released in spring 2024) to GPT - 5 (released in summer 2025), the average performance of the models on GDPval tasks has almost doubled, showing an obvious linear progress trend.
OpenAI also found that the speed and cost of top - tier models to complete GDPval tasks are on average 1% of those of humans — about 100 times faster and 100 times cheaper.
However, this data only counts the model reasoning time and API call costs, not including the resource investment required for real - world work processes such as human supervision, iterative modification, and actual integration.
Nevertheless, for task types where the model performs particularly well, letting AI try first and then having humans intervene may be an ideal strategy to save time and costs.
How to Optimize Models to Improve GDPval Performance
To verify whether the performance of GPT - 5 in GDPval tasks can be improved, OpenAI incrementally trained an experimental internal specific version of GPT - 5.
The results confirm that after this training process, the model's performance has indeed been substantially improved, showing the potential for further optimization.
The results of multiple controlled experiments in the following figure further confirm this: Expanding the model scale, guiding the model to perform more reasoning steps, and providing more abundant task background information will all bring measurable performance gains.
OpenAI designed a general prompt, requiring the model to conduct a rigorous self - check before submitting results, which can be applied to various multimodal economic tasks and is not over - fitted to specific problems.
The Most Luxurious Reviewers: 14 - Year Industry Experts from Top - Tier Institutions
In GDPval tasks, to evaluate the actual performance of models, OpenAI relies on senior practitioners as "reviewers".
The criteria for expert selection include: at least 4 years of industry experience, and the resume should reflect professional recognition, promotion trajectory, and management responsibilities. The experts participating in this project have an average of 14 years of work experience.
The industry - expert team has worked in the following representative institutions:
Meta, Microsoft, Morgan Stanley, Google, Oracle, Apple, General Electric, Goldman Sachs, HBO, IBM, J