Understand GPT-5.5 in One Article: From Today On, OpenAI "Stops Selling" Tokens
On April 23 local time, OpenAI officially released its new flagship model, GPT-5.5. The official positioned it as "a new level of intelligence for real-world work" and an important step towards a new way of computer work.
There are two key points in this release:
Firstly, there is a breakthrough in efficiency: With the same latency, the model has become larger, but the speed has not slowed down. The context window of GPT-5.5 reaches 1 million Tokens. However, it is not simply an upgrade of the capabilities of GPT-5.4. Instead, it achieves higher intelligence with the same latency in terms of efficiency.
Secondly, during the training process of GPT-5.5, it participated in the optimization of its own inference infrastructure. In short, for the first time, AI has learned to adjust its own parameters.
In the Terminal-Bench 2.0, which tests complex command-line workflows, GPT-5.5 scored 82.7%, exceeding Claude Opus 4.7's 69.4% by 13 percentage points. In the OSWorld-Verified, which tests the ability of AI to independently operate a real computer, the success rate of GPT-5.5 was 78.7%, surpassing the human baseline. In the GDPval, which tests knowledge work across 44 professions, 84.9% of the tasks reached or exceeded the level of industry experts.
However, the price of GPT-5.5 has also increased significantly.
The API pricing is $5 per million input Tokens and $30 per million output Tokens, which is twice that of GPT-5.4 ($2.50 per million input Tokens and $15 per million output Tokens). However, the official emphasized that the number of Tokens required for GPT-5.5 to complete the same task has been significantly reduced, so the overall cost may not increase significantly. The pricing of the GPT-5.5 Pro API is $30 per million input Tokens and $180 per million output Tokens. Bulk processing and flexible pricing enjoy a 50% discount, and priority processing is 2.5 times the standard price.
In ChatGPT, GPT-5.5 is launched in the form of "GPT-5.5 Thinking" and will gradually replace the previous versions.
A new small design is that before the model starts thinking, it will first give an overview of the thinking process. Users can interrupt at any time during the execution process to adjust the direction.
If we summarize the significance of GPT-5.5 in one sentence: Past models were a collection of capabilities, while GPT-5.5 is closer to a work system that can plan, check, and continuously advance.
01 84.9% of tasks reach the level of professionals
Chart: Comparison of GPT-5.5 with various competitors in core benchmark tests such as Terminal-Bench 2.0, GDPval, and OSWorld-Verified
Let's first look at the performance of the model in real professional scenarios. OpenAI used a benchmark test called "GDPval", which requires the model to complete a set of professional tasks. The test covers 44 professional scenarios, including financial modeling, legal analysis, data science reporting, operational planning, etc.
The results show that GPT-5.5 reached or exceeded the level of industry professionals in 84.9% of the tasks. In comparison, GPT-5.4 was 83.0%, Claude Opus 4.7 was 80.3%, and Gemini 3.1 Pro was only 67.3%.
This gap is not only reflected in the total score. In the spreadsheet modeling task, GPT-5.5 scored 88.5% in internal tests. It also led the previous generation in investment banking-level modeling tasks. The feedback from early testers was quite consistent: The responses of GPT-5.5 Pro were significantly improved in terms of comprehensiveness, structure, and practicality compared to GPT-5.4 Pro, especially in the fields of business, law, education, and data science.
It's easy to get numb just by looking at the numbers. This time, OpenAI simply showed you its own workplace.
OpenAI said that more than 85% of its employees use Codex every week, covering multiple departments such as finance, communication, marketing, product, and data science. The communication team used it to analyze six months of speech invitation data and set up an automated grading process. The finance team used it to review 24,771 K-1 tax forms, totaling 71,637 pages, and completed the work two weeks earlier than last year. The market expansion team saved 5 to 10 hours per person per week through automated weekly report generation.
This is not a laboratory demo but has become a part of daily work.
02 The most powerful autonomous programming model
OpenAI claims that GPT-5.5 is currently its most powerful autonomous programming model.
In the Terminal-Bench 2.0 (which tests complex command-line workflows and requires planning, iteration, and tool coordination), GPT-5.5 scored 82.7%, an increase of nearly 8 percentage points compared to GPT-5.4's 75.1%, and it consumed fewer Tokens. In the SWE-Bench Pro (which evaluates the one-time problem-solving ability for real GitHub issues), GPT-5.5 scored 58.6%. In the internal Expert-SWE evaluation (a long-term programming task with a median manual completion time of about 20 hours), GPT-5.5 also outperformed GPT-5.4.
Illustration: Scatter plots of Terminal-Bench 2.0 and Expert-SWE
Driven by GPT-5.5, Codex can now independently complete the entire development process from code generation, functional testing to visual debugging based on a single prompt.
The demonstration cases shown by OpenAI's official display that a space mission application is built based on real NASA orbital data, supports 3D interactive manipulation, and the orbital mechanics simulation reaches real physical accuracy. An earthquake tracker accesses real-time data sources and completes visualization, indicating that the model has the complete ability to call external APIs, process dynamic data, and render in real-time.
Regarding the user feedback, Dan Shipper, the founder and CEO of Every, shared an experience: He once encountered a bug after the launch of a product and spent several days trying to fix it but failed. Finally, he had to ask the company's best engineer to rewrite part of the system. After GPT-5.5 was released, he conducted an experiment - he put the model back to the state before the bug was fixed to see if it could come up with the same solution as the engineer. GPT-5.4 couldn't do it, but GPT-5.5 could. He commented, "This is the first programming model I've used that truly has conceptual clarity."
A more straightforward comment came from an NVIDIA engineer: "Losing access to GPT-5.5 feels like an amputation."
Michael Truell, the co-founder and CEO of Cursor, added that GPT-5.5 is smarter and more persistent than GPT-5.4 and can persevere in complex long-term tasks without stopping prematurely - which is exactly what engineering work needs the most.
03 Knowledge work: For the first time, AI can truly "use" a computer
In the OSWorld-Verified test (which tests whether the model can independently operate a real computer environment), the success rate of GPT-5.5 was 78.7%, higher than GPT-5.4's 75.0% and also better than Claude Opus 4.7's 78.0%.
This is not just screenshot analysis but real screen manipulation: seeing the interface, clicking, inputting, and switching between multiple tools until the task is completed. GPT-5.5 makes people feel for the first time that AI can truly share the same computer with you.
Demonstration video of financial modeling
In the Tau2-bench, which tests the workflow of telecom customer service, the accuracy of GPT-5.5 reached 98.0% without prompt optimization, while that of GPT-5.4 was only 92.8%.
This means that the model has a deep enough understanding of the task intention and can handle complex multi-step dialogue processes without carefully designed prompts.
In terms of tool search ability, GPT-5.5 scored 84.4% in the BrowseComp test, and GPT-5.5 Pro scored even higher at 90.1%. This means that in research tasks that require comprehensive reasoning across multiple information sources, the model shows quite strong continuous retrieval and information integration capabilities.
04 Scientific research: Assisting in discovering new mathematical proofs
In this release, the performance of GPT-5.5 in the scientific research field may be the most unexpected part.
In the past, when we talked about AI in scientific research, it was more of an "auxiliary tool" used for literature search, code writing, and data organization. But this time, its role has clearly shifted forward and started to participate in more core aspects: complex reasoning and even discovery itself.
In the GeneBench (a multi-stage data analysis evaluation in genetics and quantitative biology), GPT-5.5 scored 25.0%, while GPT-5.4 scored 19.0%. These tasks usually correspond to the workload of scientific experts for several days. The model needs to reason about potentially erroneous data, deal with hidden confounding factors, and correctly implement modern statistical methods with almost no supervision.
From the chart curve, we can see that as the number of output Tokens increases, the score improvement of GPT-5.5 always leads that of GPT-5.4, and there is a significant gap at about 15,000 Tokens. This means that in the face of long tasks that require in-depth reasoning, the advantage of GPT-5.5 will be further magnified as the task complexity increases.
In the BixBench (a real-world bioinformatics and data analysis benchmark test), GPT-5.5 led with a score of 80.5%, compared to GPT-5.4's 74.0%, ranking among the top among the released models.
What really attracted attention was a specific case: An internal version of GPT-5.5 equipped with a custom tool framework assisted in discovering a new mathematical proof about Ramsey numbers, which was verified in the formal proof tool Lean. Ramsey numbers are the core research object in combinatorial mathematics, and achievements in this field are very rare and technically very difficult. This is not AI providing code or explanations but truly contributing a mathematical argument.
The practical application is also convincing. Derya Unutmaz, a professor of immunology at the Jackson Laboratory, used GPT-5.5 Pro to analyze a gene expression dataset containing 62 samples and nearly 28,000 genes, generated a detailed research report, and extracted key findings and research questions. He said that this work usually takes a team several months.
Bartosz Naskręcki, an assistant professor in the Department of Mathematics at Adam Mickiewicz University in Poznań, used GPT-5.5 in Codex to build an algebraic geometry application in 11 minutes with just one prompt. It visualizes the intersection line of two quadratic surfaces and transforms the resulting curve into a Weierstrass model. The equation coefficients displayed in real-time on the right can be directly used for subsequent mathematical research. The entire process from the prompt to the runnable research tool was completed independently by the model.
Illustration: Screenshot of the algebraic geometry application built by Professor Bartosz Naskręcki - the visualization of the intersection of quadratic surfaces and the real-time calculation interface of the Weierstrass equation
Brandon White, the co-founder of Axiom Bio, was more direct in his evaluation: "If OpenAI maintains this momentum, the foundation of drug discovery will change by the end of the year."
05 Reasoning efficiency: For the first time, AI optimized its own infrastructure
There is a detail in this release that is easily overlooked, but it may be the most noteworthy progress at the technical level.
GPT-