Who Is the Most Powerful "Working AI"? OpenAI's Self - Conducted Tests Reveal the Top One Isn't Itself

The new benchmark high-quality subset has been open-sourced.

OpenAI has released its latest research, and in it, they've given a shout - out to Claude.

They proposed a new benchmark called GDPval to measure the performance of AI models in real - world tasks with economic value.

Specifically, GDPval covers 44 occupations in 9 industries that contribute the most to the US GDP. These occupations generate a total annual revenue of up to $3 trillion. The tasks are designed based on the representative work of industry experts with an average of 14 years of experience.

Professional raters compared the output results of mainstream models with the work of human experts.

In the final test, Claude Opus 4.1 emerged as the best - performing model, with 47.6% of its output rated as comparable to that of human experts.

GPT - 5, with a score of 38.8%, still lags behind Claude and ranks second; GPT - 4o only wins or ties in 12.4% of the cases compared to humans.

Not being the best, OpenAI has also found an excuse for itself: different models have their own advantages. Claude Opus 4.1 stands out mainly in aesthetics, while GPT - 5 is superior in accuracy.

OpenAI also said that it's also worth noting the progress rate of the models. The winning rate of its cutting - edge models has almost doubled in just one year.

Finally, OpenAI open - sourced a high - quality subset of 220 tasks and provided a public automatic scoring service.

After seeing this, netizens have commented that it's a very interesting research:

The performance of OpenAI's models across generations shows a linear increase, and thanks for the recognition of the competitor.

Some netizens believe that this might also be a well - designed publicity stunt by Altman, aiming to raise funds by boasting about how AI can contribute to GDP growth.

Now, let's take a closer look at this test.

Testing the "Money - Making" Ability of AI

OpenAI pointed out that the advantage of GDPval over existing AI evaluations lies in the following aspects:

The tasks are based on real - world work results and are associated with completion time and cost, making them realistic;
They cover most of the occupational work activities tracked by O*NET (US Occupational Information Network), with representative breadth;
The tasks require handling files in multiple formats and parsing multiple reference files, involving computer use and multi - modality;
In addition to correctness, subjective factors such as structure and style need to be considered. The dataset can also serve as a test platform for evaluating the performance of automatic scoring systems;
The winning rate is the main indicator with no upper limit, supporting continuous evaluation;
The tasks are highly difficult. Industry professionals on average need 7 hours to complete them, and complex tasks may even take weeks.

The task construction process starts with determining the core industries and occupations.

OpenAI first screened out 9 industries that contribute more than 5% to the US GDP (based on the percentage data of each industry's added value in the US GDP in the second quarter of 2024), and then selected 5 occupations in each industry that contribute the most to the total wage bill and mainly involve digital tasks.

When determining whether an occupation is "mainly digital - task - based", all the tasks of the occupation in ONET are referred to. GPT - 4o is used to classify the tasks as "digital/non - digital". By calculating the weighted scores of task relevance, importance, and frequency in ONET, if more than 60% of the tasks are digital tasks, the occupation is included.

Finally, OpenAI screened out 44 occupations, which together generate an annual revenue of $3 trillion.

Next, industry professionals were recruited. The experts participating in task creation are required to have at least 4 years of relevant occupational experience, and their resumes should reflect professional recognition, promotion experience, and management responsibilities.

Statistics show that the recruited industry experts have an average of 14 years of experience.

These people need to pass video interviews, background checks, training, and tests before they can participate in the project (OpenAI will also offer a generous reward). Their former employers include many well - known companies and institutions such as Apple, Google, Microsoft, Meta, Samsung, Oracle, IBM, and JPMorgan Chase, ensuring that the experts have a solid foundation in industry practice.

In the task creation phase, each GDPval task consists of two parts: "requirements" and "deliverables". Industry experts design tasks based on the task classification of their own occupations in O*NET to ensure the breadth and representativeness of task coverage.

To evaluate the quality of the tasks, OpenAI requires these experts to score each task in terms of difficulty, representativeness, completion time, and overall quality according to the actual standards of their occupations. By multiplying the average completion time by the hourly wage, the economic value of each task is calculated.

Finally, the entire GDPval set contains a total of 1320 tasks. All tasks have gone through an iterative process of "automated model screening + multiple rounds of human expert review", and each task will receive at least 3 and an average of 5 human reviews.

Experts will give detailed comments at each review stage, and the tasks will be repeatedly modified and improved based on the comments.

Claude's Performance is Comparable to Human Experts

OpenAI open - sourced a high - quality subset of 220 tasks and used the blind pairwise comparison method by experts (a pairwise comparison scoring method where experts are unaware of the source of the results to be evaluated) to rate this subset.

Each pairwise comparison scoring takes an average of more than 1 hour. OpenAI said that it also invited more experts from various professional fields to score the work of human experts and model outputs. Experts need to provide detailed justifications for their choices and rankings.

For the high - quality subset, OpenAI also developed an experimental automatic scorer, whose consistency with human expert scoring reaches 66%, only 5% lower than the consistency among humans (71%).

After evaluating several models such as GPT - 4o, o4 - mini, o3, GPT - 5, Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4, the results show:

In the GDPval high - quality subset tasks, Claude Opus 4.1 is the best - performing model overall, especially excelling in aesthetics (such as document format and slide layout).

47.6% of its output is rated as better than or equivalent to the work of human experts.

The performance of OpenAI's models across generations on GDPval generally shows a linear improvement.

As shown in the following figure, GPT - 5 has a significant advantage in accuracy (such as strictly following instructions and performing correct calculations).

In other words, GPT - 5 performs better in pure text tasks, while Claude performs better in handling file types such as.pdf,.xlsx, and.ppt, showing stronger visual perception and aesthetic design capabilities.

In all the tasks of the GDPval high - quality subset, in slightly more than 50% of the tasks, the output of at least one model is better than or equivalent to that of human experts.

OpenAI also pointed out that combining AI models with human supervision is expected to be more cost - effective and efficient than human experts alone when completing tasks.

Whether it's the model - first - then - human - modification mode, the direct - use - of - model - results mode, or the model - try - once - then - human - do - it - alone mode, they can all help humans save costs and time.

In addition, the research found that increasing reasoning efforts (such as setting different reasoning intensities for o3 and GPT - 5), providing more task background, optimizing prompts and agent - assisted frameworks (such as supporting GET requests in containers and adopting the "optimal N - out - of - 1" sampling strategy with "N = 4" and using GPT - 5 as the judgment model) can significantly improve model performance.

OpenAI also pointed out the limitations of GDPval, such as the limited dataset size (only 44 occupations), the focus on knowledge work that can be completed on a computer (excluding physical labor, etc.), the tasks being precisely specified one - time tasks (lacking interactivity), the deficiencies of the automatic scorer, and the high evaluation cost.

Currently, GDPval is still in its initial stage. OpenAI plans to gradually expand its coverage, enhance its authenticity and interactivity, and incorporate more scenario details in future iterative versions.

By the way, it's not only OpenAI that thinks highly of Claude. Recently, there's news from Microsoft, once an intimate ally: they're joining hands with Anthropic

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Who is the most powerful "working AI"? OpenAI conducted tests themselves, and the result showed that the top one wasn't itself.

Testing the "Money - Making" Ability of AI

Claude's Performance is Comparable to Human Experts