GPT-5 has gone to great lengths to "cheat" just to outperform its inner demon, Claude.
GPT5 has finally been released, but compared to GPT3.5, Sora, etc., it hasn't brought a shocking feeling to people. On the bright side, OpenAI has given up its status as the "legendary futures king" and focused on the implementation and application of large models. This actually explains why OpenAI particularly emphasized GPT-5's programming capabilities at the press conference: after all, there is no more practical AI direction this year than AI Coding. A bunch of AI IDE tools have also integrated GPT5 immediately, which would have taken at least two months in the past.
However, some media reported that OpenAI "cheated" in the programming ability test. Specifically, in the SWE‑Bench Verified programming test, OpenAI didn't actually run all 500 questions, but only 477. While models like Claude and Google ran the full 500 questions when testing the programming abilities of models.
Moreover, what's even more peculiar is that SWE‑Bench Verified is a "refined version" introduced by OpenAI. Since the original SWE‑Bench had 2,294 software engineering questions, OpenAI thought some of these questions were too difficult and unstable to fairly evaluate the programming abilities of models. So, OpenAI selected 500 questions on its own to make the evaluation more reliable. What's even more absurd is that from this "self - selected subset", some questions were cut, leaving only 477 questions for the evaluation.
OpenAI published a blog post on its official website to explain and introduce why it launched SWE‑Bench Verified: https://openai.com/index/introducing-swe-bench-verified/
Some netizens complained: What is OpenAI afraid of?
To figure out what SWE‑Bench Verified is and what abilities it tests, we specifically downloaded the questions, annotations, and scoring criteria from the OpenAI official website and conducted a practical exercise.
We downloaded the questions, annotations, and scoring criteria of SWE‑Bench Verified from the channels provided on the OpenAI official website.
SWE‑Bench Verified is a high - quality evaluation dataset for real - world software engineering problems, aiming to measure code repair and understanding abilities. This dataset contains 500 verified test samples, each accompanied by key information such as code repository information, problem descriptions, repair patches, test patches, and difficulty labels.
The difficulty of the questions is mainly distinguished based on the "completion time". For example, tasks that can be completed within 15 minutes are relatively simple, while more difficult tasks may take more than 4 hours. Currently, 38.8% of the tasks in SWE‑Bench Verified can be completed within 15 minutes, 52.2% take 15 minutes to 1 hour, 8.4% of the tasks take 1 to 4 hours, and only 0.6% of the tasks take more than 4 hours.
The samples in the test come from multiple well - known open - source projects, including django/django, sympy/sympy, sphinx-doc/sphinx, pandas/pandas, scikit - learn/scikit - learn, matplotlib/matplotlib, pytorch/pytorch, numpy/numpy, requests/requests, etc.
Each project tests the code abilities of large models in various aspects. For example, django/django: As the project with the highest proportion, it mainly tests developers' understanding of large - scale Web frameworks, especially in database query optimization, URL routing, middleware processing, etc. pandas/pandas: A representative in the field of data analysis, it tests the mastery of data structures and data processing algorithms, especially in handling large - scale data and complex data transformations.
We asked GPT5 to select 10 representative projects, covering various abilities of large models.
1. Django/Django - The King of Web Frameworks
GitHub: https://github.com/django/django
Problem: Optimize the .delete() method to use only necessary fields
Test Focus: Database query optimization and performance testing
Significance: Django is the most popular Python Web framework. This problem involves ORM performance optimization and tests the efficiency of database operations
2. SymPy/SymPy - Symbolic Mathematical Computation
GitHub: https://github.com/sympy/sympy
Problem: Distance calculation error (3D coordinates are ignored)
Test Focus: Numerical calculation accuracy and boundary condition testing
Significance: SymPy is a Python symbolic mathematics library. It tests the accuracy of mathematical calculations and the handling of boundary conditions
3. Sphinx - doc/Sphinx - Documentation Generation Tool
GitHub: https://github.com/sphinx - doc/sphinx
Problem: 404 link problem in SVG format of inheritance diagrams
Test Focus: Documentation generation and link integrity testing
Significance: Sphinx is the standard Python documentation generation tool. It tests the correctness of document rendering and links
4. Matplotlib/Matplotlib - Data Visualization
GitHub: https://github.com/matplotlib/matplotlib
Problem: The logarithmic axis reversal function fails
Test Focus: Graphic rendering and coordinate system testing
Significance: Matplotlib is the benchmark Python plotting library. It tests the coordinate transformation of complex graphic systems
5. Scikit - learn/Scikit - learn - Machine Learning
GitHub: https://github.com/scikit - learn/scikit - learn
Problem: Problem with the store_cv_values parameter of RidgeClassifierCV
Test Focus: Machine learning parameter verification testing
Significance: Scikit - learn is the most important ML library. It tests the handling of algorithm parameters and cross - validation
6. Astropy/Astropy - Astrophysics
GitHub: https://github.com/astropy/astropy
Problem: Incorrect calculation of the separability matrix for nested composite models
Test Focus: Complex model combination and mathematical calculation testing
Significance: Astropy is specifically used for astronomical calculations. It tests the combination logic of complex mathematical models
7. Pydata/Xarray - Multidimensional Data Analysis
GitHub: https://github.com/pydata/xarray
Problem: Type coercion of Variable.__setitem__ for objects with a values attribute
Test Focus: Multidimensional data type handling testing
Significance: Xarray handles multidimensional labeled arrays. It tests data type conversion and attribute access
8. Pytest - dev/Pytest - Testing Framework
GitHub: https://github.com/pytest - dev/pytest
Problem: ValueError occurs when collecting tests for patch arrays
Test Focus: Testing the functionality of the testing framework itself
Significance: Pytest is the standard Python testing framework. It tests the stability of the testing tool itself
9. Pylint - dev/Pylint - Code Quality Check
GitHub: https://github.com/pylint - dev/pylint
Problem: The short parameter of the verbose option requires a parameter value
Test Focus: Command - line tool interface testing
Significance: Pylint is a code quality checking tool. It tests command - line parameter parsing and the user interface
10. PSF/Requests - HTTP Library
GitHub: https://github.com/psf/requests
Problem: Binary payload requests fail due to a call to to_native_string
Test Focus: HTTP protocol and binary data testing
Significance: Requests is the most popular HTTP library. It tests network communication and data encoding processing
As for why OpenAI deleted 23 test questions instead of using the full version, the answer may lie in the following ranking. In the full version of SWE‑Bench Verified, that is, based on 500 questions, GPT5 didn't outperform Claude 4 Opus.
However, there is another twist. The above tests are based on "bash only", that is, they completely rely on the capabilities of the large model itself. In reality, users usually use AI IDEs in conjunction with large models, such as Cursor, Codebuddy, Trae, etc. Then the problem arises. Among the models provided by AI IDEs, the "best" one, Claude 4 Opus, is very expensive, and the tokens are easily used up. In other words, is GPT5 currently the most cost - effective and available programming model?
Actual Test Section
Of course, the scores can only represent the performance of the models. We still need to try them out in practice.
We used GPT5 to create a SWE‑Bench Verified database query tool in the Codebuddy environment (with the annotations, scoring criteria downloaded from the OpenAI official website, and a database based on Huggingface).
Prompts: Create a SWE‑Bench Verified database query tool that can easily query what problems are in SWE‑Bench Verified, the links to the problems, and the scoring criteria.
The generation process of GPT5 was relatively smooth, and there were no irreversible bugs. The first version only showed 11 projects, and after one round of communication, it was supplemented to 500.
Preview of the version created by GPT5: http://4d916460ea034a90bd4e0c1dd25efc6b.ap-singapore.myide.io
Subsequently, we used the same prompts to generate a project with Claude - 4 - sonnet. It was very obvious that the one - time success rate of Claude - 4 - sonnet was lower than that of GPT5. For example, the common problem of web pages not displaying was only resolved after multiple rounds of interaction with Claude.
Preview of the version created by Claude - 4 - sonnet: http://7561fbea40ff4069a3c2c8ae367cd7ea.ap-singapore.myide.io
In terms of the UI, since both used the MUI framework, there was not much difference in visual style. However, in terms of detail refinement, the web page created by Claude - 4 - sonnet was significantly better - the responsive layout was more excellent and could maintain an elegant display on different screen sizes. The organization of external link information was also more reasonable. For example, the distribution of project issues and details was clear, while the page created by GPT5 not only "exposed" the database source (HuggingFace), but the content arrangement logic was also a bit chaotic.
In terms of functionality, GPT5 performed excellently in the filtering function. The number of repository labels was complete (10), better than Claude - 4 - sonnet's 8. However, from the perspective of the interaction experience, the filtering operation of Claude - 4 - sonnet was more intuitive and user - friendly, and it provided a dedicated filtering entry for mobile devices, reducing the number of operation steps.
To be more objective, we also introduced Gemini 2.5 Pro to score the two projects. The results showed that the project created by Claude - 4 - sonnet was better than GPT5 in almost all key dimensions. The former had a modular architecture as the core, partitioned components by function, and achieved the separation of data and views through custom Hooks, with better maintainability and readability; the latter adopted a flat component structure, with highly coupled data logic and UI, more like a prototype verification application.
In the overall functional experience, Claude - 4 - sonnet not only integrated functions such as search, view switching, and responsive layout, but also