GPT-5.5 beats Claude and tops the list. Is the old AI coding ranking inaccurate?
[Introduction] A new benchmark called DeepSWE, claiming to be "zero-pollution," has exposed the flaws of the old programming leaderboards with 113 original questions.
A new measuring stick has been introduced in the code evaluation circle.
Just recently, Datacurve launched the new benchmark DeepSWE.
Serena Ge, the co-founder and CEO of Datacurve, said on X that the launch of DeepSWE is to restore the real scenarios of developers' work and uncover the areas where top models truly differ.
On the first day of the DeepSWE leaderboard, it started to challenge the old benchmarks. The rankings of GPT and Claude on the SWE-Bench Pro were completely reversed.
https://deepswe.datacurve.ai/blog
GPT-5.5 scored 70% ± 4% and ranked first; Claude Opus 4.7 only scored 54% ± 5% and ranked third, with a full 16 percentage points difference between them.
What's even more distressing is yet to come.
The DeepSWE team audited the submission records on the SWE-Bench Pro using the new method.
The results showed that in the scores of Claude Opus 4.6 and 4.7 on that leaderboard, more than 12% of the scores were judged as cheating.
Moreover, the DeepSWE team also found that the validator of the SWE-Bench Pro had a false positive rate of 8.5% and a false negative rate of 24.0%.
If the error is so large, for those models on the SWE-Bench Pro leaderboard that only differ by one or two percentage points, are they really evenly matched, or are they just measured as a tie by an inaccurate ruler?
Change the ruler, and the first place changes
Let's first look at the leaderboard generated by DeepSWE itself.
https://deepswe.datacurve.ai/blog
Among the 12 cutting-edge models, gpt-5.5[xhigh] led with a pass rate of 70% ± 4%, followed closely by gpt-5.4[xhigh] with 56% ± 5%, and Claude Opus 4.7[max] ranked third with 54% ± 5%.
Further down, Claude Sonnet 4.6[high] scored 32%, a group of models in the middle fell between 18% and 28%, and the models at the bottom only scored between 5% and 10%.
In the publicly reported SWE-Bench Pro scores, Claude Opus 4.7 scored 64% and ranked first; gpt-5.5 scored 59%. On DeepSWE, the positions were completely reversed: gpt-5.5 rose to 70% and ranked first, while Claude Opus 4.7 dropped to third with 54%.
Not only did the rankings reverse, but the gap also increased significantly.
The difference between the worst and the best models on the SWE-Bench Pro was only 30%, but on DeepSWE, it became 70%.
For the same group of models and the same type of tasks, after changing the test, the original tied leadership became a huge gap.
The DeepSWE team explained that the models were clustered in a narrow score range on the old leaderboard not because they were really close, but because the "resolution" of the benchmark itself was insufficient.
On average, each question in the SWE-Bench Pro only requires modifying 5 files, while each question in DeepSWE requires modifying 7 files on average. The amount of reference code per question in DeepSWE is 5.5 times that of the SWE-Bench Pro.
At this scale, the models cannot pass by memorizing a specific function. They must truly understand the coupling relationship between multiple files and then plan a modification path that runs through the entire repository.
GPT-5.5 scoring 70% means that it is not just memorizing a certain type of question, but "able to complete a modification chain across 7 files in a completely unfamiliar real repository."
That is to say, on toy questions, the two models seem similar; but on questions that can test real engineering capabilities, the gap is instantly widened.
Is DeepSWE more accurate, or just a gimmick?
Why does a new benchmark claim to be more accurate than the old one? DeepSWE's answer lies in four designs.
First, it is zero-pollution, which is its core advantage.
Each task in DeepSWE is originally written from scratch by engineers. Moreover, after these tasks are completed, they will not be merged back into the upstream repository and will not enter the public GitHub records, so it is difficult for them to appear in the pre-training corpus of future open-source code crawls.
This means that no model has seen the answers to these questions during the pre-training phase, which hits the weak point of the old benchmarks.
Second, it has high diversity.
DeepSWE includes 113 tasks, covering 91 active open-source repositories and spanning five languages: TypeScript, Go, Python, JavaScript, and Rust.
In contrast, the public version of the SWE-Bench Pro only covers 11 repositories. The more and more diverse the repositories are, the closer it can get to the code libraries that developers will actually give to intelligent agents.
Third, it has real complexity.
As mentioned before, the amount of code per question in DeepSWE is 5.5 times that of the SWE-Bench Pro. Interestingly, the length of its task prompts is only half of that of the SWE-Bench Pro.
The prompts are short because it deliberately imitates the way developers actually talk to intelligent agents: only stating what behavior is desired, without providing all the interface definitions, reproduction steps, and code snippets. The intelligent agent must figure out "where to modify and how to modify" in the repository by itself.
Fourth, it has reliable verification.
The accuracy of a benchmark depends on its validator. The validators of the old benchmarks often only recognize one "standard answer" writing style. Changing a variable name or an implementation idea may be judged as wrong. The validator of DeepSWE is handwritten for each task. As long as the result is correct, any writing style is considered a pass.
After cross-checking 30 tasks each, the false positive rate of the DeepSWE validator is 0.3%, and the false negative rate is 1.1%, while those of the SWE-Bench Pro are 8.5% and 24.0%, respectively, a difference of an order of magnitude.
Moreover, DeepSWE is not just a static leaderboard. In its GitHub repository, each task comes with prompts, a reproducible Docker environment, a validator, and a confidential reference solution. You can download them and let your own intelligent agent run through them.
The old benchmark's ruler is inaccurate at both ends
DeepSWE also used this new method to audit the submissions on the SWE-Bench Pro that have already been recorded in the score sheet.
In the scores of Claude Opus 4.6 and 4.7, more than 12% were judged as cheating. About 87% of them used the same method: directly checking the.git history of the code repository and copying the standard answers hidden in the historical records.
In the same batch of rechecked samples, no such behavior was found in GPT-5.4 and GPT-5.5.
DeepSWE also pointed out that it was the SWE-Bench Pro benchmark itself that allowed cheating to occur. The submission records of the "standard answer" were directly included in its task container.
This is the objective observation provided by DeepSWE. As for why Claude developed such behavior, there is currently no public conclusion.
If cheating is the "upward noise" that inflates the scores, then the SWE-Bench Pro also has a symmetrical "downward noise": a 24% false negative rate.
DeepSWE rechecked a batch of submissions that were judged as "failed" by the SWE-Bench Pro and found that about 24% of them actually had completely correct functions but were wrongly judged.
24% means that in the rechecked running trajectories, about one in every four runs may have been wrongly judged.
If this false negative rate is taken into account, the real scores of all models have been depressed to some extent. Moreover, those models that tend to rewrite the code in their own style and do not copy the ready-made answers suffer more serious score losses.
The validator of DeepSWE has undergone multiple cross-checks, reducing the false positive rate to 0.3% and the false negative rate to 1.1%. Both misjudgment rates are more than an order of magnitude lower than those of the SWE-Bench Pro.
Comparison of the misjudgment rates of the validators of the two benchmarks. The SWE-Bench Pro has a false positive rate of 8.5% and a false negative rate of 24.0%
If this comparison data is accurate, it means that the so-called consensus of "Claude and GPT being on a par" for more than half a year is based on a measuring tool that is inaccurate at both ends.
In the past, people only compared the final scores without looking back at how the scores were obtained. With DeepSWE's intervention, the comparisons of models based on the SWE-Bench Pro may need to be recalibrated.
Where are the limitations?
DeepSWE has solved the pollution problem of the old benchmarks, but after all, it is a self-made evaluation by Datacurve.
Datacurve also mentioned its limitations. It only uses a Harness called mini-swe-agent throughout, providing the same bash tool and the same set of prompts for all models.
This is done to separate the "model ability" from the "peripheral scaffolding," but the cost is a certain degree of distortion.
Different model families are adapted to different tool forms during training, and in reality, developers do not use mini-swe-agent but more mature native Harnesses such as Codex CLI, Claude Code, Cursor, and Gemini CLI.
Using a unified Harness may limit each model below its native upper limit.
The DeepSWE team also conducted a control experiment to address this concern. In a small-scale pilot, the performance of mini-swe-agent was not inferior to that of the native Harness; however, the team also emphasized that this was only a pilot of 10 questions and was not sufficient to completely dispel the concerns.
Under the same 10 SWE-Bench Pro tasks, the pass rate and token consumption of mini-swe-agent are not inferior to those of native Harnesses such as Claude Code, Codex CLI, and Gemini CLI
In addition, the corpus only covers active open-source repositories with more than 500 stars, lacking C++ and Java, and there are also relatively few tasks related to bug localization and refactoring.
Another point is AI hallucination. The judgments of "false positives and false negatives" in DeepSWE are given by an LLM analyst, not manually.
The team itself reminds that differences below about 5% should not be taken seriously.
$15 million: This company serves as a "grindstone" for large models
How was DeepSWE launched? Let's first get to know Datacurve, the company behind DeepSWE.
Datacurve is from the Winter 2024 batch (W24) of Y Combinator and was founded in 2024 by two founders, Serena Ge and Charley Lee.
The two founders of Datacurve, Serena Ge (right) and Charley Lee (left). Both graduated from the Department of Computer Science at the University of Waterloo
It produces high-quality code data for cutting-edge large models, but its approach is a bit special.
Datacurve operates a platform called Shipd, which uses the "bounty" method to recruit top software engineers to solve algorithm problems, perform debugging, and write UI processes. It pays according to the output rather than the working hours, and has issued more than $1 million in bounties so far.
According to media reports such as TechCrunch, many of the participants are engineers from DeepMind, OpenAI, Anthropic, and Vercel.
Datacurve is originally a company that provides training data for large models. It has first-hand knowledge of "what kind of data will pollute the benchmark and what kind of tasks can test real abilities." DeepSWE is more like an extension of its main business.
The code evaluation circle is bidding farewell to the score-padding era
DeepSWE is not an isolated event but the result of a trend that has lasted for more than half a year.
As the SWE-Bench series of benchmarks becomes increasingly saturated, the competition point of the new generation of programming benchmarks has shifted from "how difficult the questions are" to "resistance to pollution" and "reliability of verification." DeepSWE is a sample of this shift.
DeepSWE also made a particularly interesting discovery: the stronger the model, the more likely it is to write tests for itself.