GPT - 5 Scores Only 23.3%: Global AI Systems Collectively Fail in Extremely Difficult Programming Exam, Shattering Gold

In the coding world, a hardcore exam is needed.

In the programming exam, the world's top LLMs aim for gold. Are they truly invincible? The toughest coding benchmark, SWE-Bench Pro, has emerged, featuring problems with an average of over 100 lines of code. Unexpectedly, the most powerful LLMs have suffered defeats. GPT-5 only scored 23.3%.

Get a comprehensive view of global large models at a glance! As a grand offering for the 10th anniversary of New Intelligence Yuan, the 37-page 2025 ASI Frontier Trends Report is released for the first time.

After topping the IMO 2025, the models of Google and OpenAI have once again won the ICPC gold medal.

ICPC is recognized as one of the most challenging programming competitions for college students globally.

OpenAI and Google not only solved all 12 problems but also ranked first among human contestants. Can AI programming really be invincible?

The latest benchmark test has directly slapped the face of the world's top models.

It is SWE-Bench Pro, a new-generation benchmark test designed specifically for evaluating AI programming agents, directly facing real enterprise-level engineering tasks.

Compared with its predecessor, SWE-Bench, the Pro version upgrade brings three major breakthroughs:

Comprehensive improvement in task difficulty
Stronger resistance to data pollution
Infinitely approaching real code repositories

This version can be regarded as the "last human exam" in coding. In the actual test (public set), the top models almost suffered defeats.

Although GPT-5 ranked first, its score was only 23.3%. Claude Opus 4.1 ranked second with a score of 22.7%.

None of the other models were competitive, and all scored below 15%.

This means that in programming tasks closer to the real world, the long-range coding ability of LLMs remains a shortcoming.

The latest 21-page technical paper details the design of SWE-Bench Pro.

Paper link: https://static.scale.com/uploads/654197dc94d34f66c0f5184e/SWEAP_Eval_Scale%20%289%29.pdf

A hardcore exam is needed in the coding field

In the past, benchmarks such as SWE-Bench have become important standards for evaluating LLMs in software engineering.

In these tests, AI is usually required to generate code patches based on a complete code repository and natural language descriptions.

Take SWE-Bench Verified for example. The top LLMs have achieved a success rate of over 70%, which sounds really impressive.

However, this also exposes a problem:

In the next 6 to 12 months, the existing benchmarks may become "saturated" and can no longer effectively measure the progress of AI.

Therefore, Scale AI officially released SWE-Bench Pro.

It provides a more realistic and challenging "examination room" to directly address the shortcomings of the existing benchmarks.

Data pollution and disconnection from reality

Currently, the existing coding benchmark tests have two major flaws.

On the one hand, there is a high risk of data pollution. Many benchmarks are built based on open-source GitHub repositories. However, these repositories, especially those under the MIT and Apache licenses, are easily "crawled" by LLM training data.

As a result, AI may "cheat" in the test, as it may have seen similar problems before.

On the other hand, the tasks of the existing benchmarks are too simple and do not reach the "industrial level".

Taking SWE-Bench Verified as an example, among the 500 questions, 161 only require modifying 1 - 2 lines of code.

This may work in the laboratory, but in an enterprise environment, it often involves complex modifications across multiple files and hundreds of lines of code.

Such benchmarks cannot reflect the performance of AI in real development scenarios at all.

The coding exam is not the ultimate goal of AI agents, but a more hardcore benchmark can truly evaluate whether LLMs meet the standards of industrial applications.

SWE-Bench Pro: Problems with over 100 lines of code

In the design of SWE-Bench Pro, there are a total of 1865 manually verified and enhanced questions, which are divided into three subsets: the public set, the commercial set, and the reserved set.

In the paper, the research team introduced three major contributions of SWE-Bench Pro:

Clever collection and design to reduce the risk of data pollution

SWE-Bench Pro has an innovative data collection strategy to avoid pollution traps.

(1) Only code repositories under strong copyleft licenses (GPL) are used to build the public set (11 code repositories) and the reserved set (12 code repositories);

(2) Commercial codes are obtained from real startups to build the commercial set (18 code repositories) to capture enterprise-level problems.

• Public set: 731 instances are publicly released on HuggingFace, and relevant statistics and model performance are reported in this paper. These instances are sourced from public code repositories under copyleft licenses.

• Commercial set: 276 commercial set questions are from the code repositories of startups. This is the only set that contains proprietary code repositories of startups and cannot be made public due to legal restrictions.

• Reserved set: A set of 858 questions with a structure mirroring that of the public set but using different code repositories is reserved.

Task upgrade: more challenging, diverse, and closer to industry

To ensure task complexity, Scale AI excluded tasks that only require "minor repairs" of 1 - 10 lines of code and only retained questions that require multi-file and substantial modifications.

The reference solutions involve an average of 4.1 files and 107.4 lines of code. All tasks require modifying at least 10 lines, and over 100 tasks require modifying over 100 lines.

In addition to complexity, the selected code repositories are actively maintained and cover multiple fields such as consumer apps, B2B services, and developer tool platforms.

Moreover, each code repository contributes 50 - 100 instances (with a maximum of 100) to avoid over - reliance on a single repository.

Human - machine collaborative verification to ensure task solvability

Having these difficult problems is not enough. The last step is to ensure that they are solvable.

For this reason, SWE-Bench Pro introduced a human - centered enhanced verification process, which is a three - stage human - machine collaborative process.

On the one hand, it can clarify ambiguous information and supplement missing context; on the other hand, by constraining the solution space, it remains flexible while avoiding false negatives.

Claude ranks first in the enterprise - level, but only gets a "highest score" of 17.8%

The performance of different top models on SWE-Bench Pro is shown in Table 1 below.

Using Pass@1 as the problem - solving rate indicator, GPT-5 and Claude Opus 4.1 lead with problem - solving rates of 23.3% and 22.7% respectively.

Early - generation models, such as DeepSeek Qwen - 3 32B and GPT - 4o, perform significantly worse, with only 3.4% and 3.9% respectively.

In addition, there is a significant performance gap between the public set and the commercial set.

The scores of the best models in the commercial set are all below 20%, which indirectly confirms the challenge of handling enterprise - level code repositories.

Overall, the pass rate of LLMs in the public set is ≤ 23.3%, and in the commercial set is ≤ 17.8%, far lower than the over 70% in SWE-Bench Verified.

What exactly is the reason behind this?

Unfamiliar languages can also affect performance

From the perspective of programming languages, the performance of AI varies significantly.

In Go and Python tasks, most models have a relatively high problem - solving rate, and some even exceed 30%.

In contrast, the performance of JavaScript (JS) and TypeScript (TS) fluctuates greatly. Depending on the model, the problem - solving rate ranges from 0% to over 30%.

Moreover, the code repositories vary greatly. In some repositories, the problem - solving rate of all models is below 10%.

In some others, it can reach 50%.

The complexity of the repository, the quality of the documentation, or the type of questions also affect the performance of LLMs in coding tasks.

It can be seen that Claude Opus 4.1 and GPT-5 can maintain stable and high performance in most repositories and programming languages.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

GPT-5 only scored 23.3%. AI systems worldwide failed collectively in the extremely difficult programming exam, shattering the myth of winning gold.