Major Reversal in GPT - 5 Programming Evaluation: Apparent Failure but 63.1% Tasks Unsubmitted, Score Twice That of Claude if All Counted

GPT-5, Claude, and Gemini all made mistakes, but GPT-5 still had a slight edge.

A reversal has occurred with Scale AI's new software engineering benchmark, SWE-BENCH PRO!

On the surface, the so - called "Big Three" all failed miserably, with none of them achieving a solution rate exceeding 25%:

GPT - 5, Claude Opus 4.1, and Gemini 2.5 "ranked" in the top three with solution rates of 23.3%, 22.7%, and 13.5% respectively.

However, a deeper look into the data reveals hidden mysteries.

Neil Chowdhury, a former OpenAI researcher, stated that if only submitted tasks are considered, GPT - 5 can achieve an accuracy rate of 63%, nearly twice as high as Claude Opus 4.1's 31%!

(Isn't this another win for GPT?)

In other words, GPT - 5 remains stable on the questions it excels at, with a small gap from the 74.9% in the old benchmark SWE - Bench - Verified, while Claude and other models perform extremely poorly.

So, what kind of benchmark test has made these top - tier models look so bad?

SWE - BENCH PRO

Let's start with the conclusion. It's not that the models have become worse, but that the questions have become more difficult.

Compared with SWE - Bench - Verified, which has an average correct rate as high as 70%, SWE - BENCH PRO is much more rigorous.

On the one hand, as a test set released by OpenAI in August 2024, many codebases in SWE - Bench - Verified have been used as pre - training corpora for large language models, posing a risk of data contamination.

On the other hand, SWE - Bench - Verified also contains many trivial questions. For example, 161 out of 500 questions only require one or two lines of modification.

This is quite different from the scenarios in industrial software engineering, which usually involve cross - multiple - file and hundreds - of - lines modifications, and thus cannot truly reflect the challenges faced in actual development scenarios.

Based on this, SWE - BENCH PRO features brand - new questions to ensure that the models have never encountered the test content during the training phase, thus more realistically testing the actual capabilities of the models.

Diverse codebases covering 1865 commercial applications, B2B services, and developer tools

Specifically, SWE - BENCH PRO constructs these codebases into the following three subsets:

Public set: 731 questions from 11 public codebases under copy - left licenses.

Commercial set: Questions from 276 codebases of startup companies.

Reserved set: 858 questions from 12 public codebases under copy - left licenses.

(Note: The public set will be released on HuggingFace, while the commercial set and the reserved set will remain private. The test results of the commercial set will be made public, and the reserved set is used to verify whether the model is overfitted. Each question consists of a task description, a relevant test set, and a runnable environment.)

These commercial codebases obtained from strong Copyleft license (GPL) codebases and real startup companies can effectively solve the data contamination problem in SWE - Bench - Verified.

To ensure the complexity of the tasks, the research team also excluded trivial edits such as 1 - 10 line code edits and retained questions that require extensive multi - file modifications.

In addition, to prevent the model from overfitting to any single codebase, these codebases are all active and cover consumer applications, B2B services, and developer tool platforms.

Next, let's see how the researchers conducted tests on these questions.

Human - in - the - loop testing phase

The research team focused the model evaluation on whether the model can achieve the given repair or patch after obtaining sufficient details.

Based on SWE - Bench Verified, the research team manually enhanced each question in SWE - BENCH PRO and added problem statements, requirement descriptions, and interface information.

First, the research team provided a problem statement for the problem to be solved and supplemented context information when necessary.

Second, for potential ambiguous problems, a series of requirements were listed for each question, and the corresponding classes and functions were specified.

After that, in terms of the environment, each task was evaluated in a containerized environment specific to a particular language.

During the testing phase, the research verified whether the problem had been solved through the fail2pass test and ensured that the existing functions remained intact through the pass2pass test.

Among them, to ensure the quality of the tests, the fail2pass tests were manually screened to remove tests that were irrelevant to the task or too broad.

For tests that failed occasionally, they were run three times to ensure stable results.

Experimental conclusions

As mentioned at the beginning, the solution rate of large language models on SWE - BENCH PRO is only at a moderate level, far lower than the 70% in SWE - Bench Verified.

Among them, on the public set, GPT - 5 and Claude Opus 4.1 achieved the highest solution rates of 23.3% and 22.7% respectively, significantly outperforming small - scale models. Claude Sonnet 4 also reached a solution rate of 16.3%.

However, older models like DeepSeek Qwen - 3 32B and GPT - 4o performed somewhat disappointingly, with only 3.4% and 3.9% respectively.

On the commercial set, even the score of the best - performing model is lower than 20%.

This indicates that the current models still have very limited capabilities in solving problems in real - world commercial scenarios.

In response to these bitter experimental results, the researchers conducted further analysis, and the conclusions are as follows:

First, the difficulty of programming languages, the codebases, and the types of models are regarded as the key factors affecting the model performance.

Go and Python usually perform well, with some models achieving a solution rate of over 30% on these languages, while JavaScript and TypeScript show greater fluctuations, ranging from 0% to over 30%.

The differences in solution rates among different codebases are also obvious. The solution rates of some codebases are generally low (below 10%), while others exceed 50%.

Cutting - edge models such as Claude Opus 4.1 and GPT - 5 perform stably in most programming languages and codebases, while small - scale models are more likely to have solution rates close to zero.

Second, the failure reasons of different models often vary.

The main failure mode of OPUS 4.1 is insufficient semantic understanding, with incorrect answers accounting for 35.9% and syntax errors accounting for 24.2%, indicating that it has strong technical execution ability but faces challenges in problem understanding and algorithm correctness.
The results of GPT - 5 show that there may be differences in the effectiveness of tool use, but there are relatively few incorrect answers.
The main failure modes of SONNET 4 are context overflow (35.6%) and significant endless file - reading behavior (17.0%), indicating its limitations in context management and file navigation strategies.
The failure modes of GEMINI 2.5 are more balanced, covering tool errors (38.8%), syntax errors (30.5%), and incorrect answers (18.0%), showing that it maintains a certain level of ability in multiple dimensions.
As an open - source model, QWEN3 32B shows the highest tool error rate (42.0%), highlighting the importance of integrated tool use for efficient agents.

It's not hard to see that although GPT - 5 continues its previous answering strategy of "either can or can't", facing a high non - answer rate of 63.1%, its performance is still not satisfactory.

So, which large model will be the first to break through the 30% mark?

Reference links

[1]https://x.com/vbingliu

[2]https://scale.com/leaderboard/swe_bench_pro_public

[3]https://x.com/ChowdhuryNeil/status/1969817448229826798

[4] https://scale.com/research/swe_bench_pro

This article is from the WeChat official account "QbitAI". The author focuses on cutting - edge technologies. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The programming evaluation of GPT-5 has a major reversal. On the surface, it fails, but actually 63.1% of the tasks were not submitted. If all are counted, its score is twice as high as that of Claude.

SWE - BENCH PRO

Diverse codebases covering 1865 commercial applications, B2B services, and developer tools

Human - in - the - loop testing phase

Experimental conclusions