HomeArticle

OpenAI test claims GPT-5 comparable to experts

36氪的朋友们2025-09-26 09:26
OpenAI launches GDPval benchmark, GPT-5 and Claude Opus 4.1 approach expert level.

OpenAI stated that its GPT - 5 model and Claude Opus 4.1 from its competitor Anthropic "are already approaching the work quality of industry experts."

On Thursday (September 25th, local time), artificial intelligence (AI) research company OpenAI released a new benchmark test to compare the performance of its AI models with that of professionals in various industries.

This test, named GDPval, is a preliminary attempt to assess how close OpenAI's systems are to surpassing humans in economically valuable work, which is a key part of OpenAI's development of artificial general intelligence (AGI).

OpenAI said on Thursday that its GPT - 5 model and Claude Opus 4.1 from its competitor Anthropic "are already approaching the work quality of industry experts."

This doesn't mean that OpenAI's models will immediately replace human jobs. Although some CEOs predict that AI will replace humans within a few years, OpenAI admits that GDPval currently only covers a limited portion of the actual tasks in people's jobs. However, it is one of the latest ways for the company to measure AI's progress towards this milestone.

GDPval is based on the nine industries that contribute the most to the US GDP, including healthcare, finance, manufacturing, and government. The test covers 44 occupations, ranging from software engineers to nurses and journalists.

In the first version, GDPval - v0, OpenAI invited senior professionals to compare the reports generated by AI with the work of other professionals and select the better one.

For example, one task required investment bankers to create a competitive landscape analysis for the "last - mile delivery industry" and compare it with the report generated by AI. OpenAI then calculated the average "win rate" of the AI model against human reports across all 44 occupations.

The results showed that GPT - 5 - high (the high - computing - power version of GPT - 5) was rated as better than or equal to industry experts in 40.6% of the cases.

Anthropic's Claude Opus 4.1 model was rated as not inferior to industry experts in 49% of the tasks, outperforming OpenAI's model.

OpenAI explained that the reason Claude scored higher is partly because it tends to generate more aesthetically pleasing charts rather than having purely superior performance.

It should be noted that the work of most occupations involves much more than just submitting research reports, which is all that GDPval - v0 tests. OpenAI acknowledges this and plans to develop more comprehensive tests in the future, covering more industries and interactive work processes.

Nevertheless, OpenAI still believes that the progress of GDPval is of great significance.

Aaron Chatterji, the chief economist at OpenAI, said in an interview that the test results of GDPval indicate that people in these positions can use AI models to save time and focus on more meaningful work.

"As the models have become quite good at certain things, as their capabilities improve, people can increasingly delegate some work to the models and do potentially more valuable things," Chatterji said.

Tejal Patwardhan, the head of evaluation at OpenAI, said that she is encouraged by the progress rate of GDPval.

Patwardhan pointed out that the GPT - 4o model, released about 15 months ago, only scored 13.7% (winning or tying with humans), while the performance of GPT - 5 has almost tripled. She expects this trend to continue.

This article is from the WeChat official account "Science and Technology Innovation Board Daily." Author: Xia Junxiong. Republished by 36Kr with authorization.