GPT-5.5 is here, ranking first on the list and crushing Opus 4.7. OpenAI will avenge its humiliation tonight.
Silicon Valley won't sleep tonight!
Just now, GPT-5.5 made a stunning debut — the most powerful and versatile new-generation flagship model by OpenAI to date.
It represents a whole new level of intelligence, completely evolving into the 'native brain' of the Agent era.
Yes, it's the highly anticipated 'Spud' that finally emerged today.
What's most remarkable is that GPT-5.5 ranked first across the board in various benchmark tests!
Whether in programming, reasoning, mathematics, or agent tasks, Claude Opus 4.7 and Gemini 3.1 Pro were completely outperformed by GPT-5.5.
Compared with the previous generation, GPT-5.5 Thinking is like a 'dimensionality reduction strike', widening the generational gap.
In the AAI test, with the same output tokens, GPT-5.5's intelligence index is the highest in the world; in ARC-AGI-2, it also refreshed the SOTA.
Altman couldn't help but praise, "GPT-5.5 is both smart and fast."
It has the same speed per token as GPT-5.4, and the token usage per task is significantly reduced.
It can almost understand what it needs to do!
President Greg excitedly said, "This is a step towards a whole new way of computer work."
As of today, GPT-5.5 is officially launched on ChatGPT and Codex.
The new programming king has arrived, and Opus 4.7 has fallen from grace
Let's first look at the most core programming field. GPT-5.5 has made a remarkable comeback!
In OpenAI's words, it is the most powerful agent programming model to date.
The Terminal-Bench 2.0 test assesses the full-link Agent engineering capabilities.
The test gives the model a terminal environment and a vague goal, and asks it to plan the path, adjust tools, write scripts, handle errors, and iterate repeatedly on its own.
Here, GPT-5.5 scored 82.7%, GPT-5.4 scored 75.1%, and Claude Opus 4.7 only scored 69.4%. A 13-percentage-point gap, a crushing victory.
In OpenAI's internal Expert-SWE evaluation, which specifically tests long-cycle programming tasks with an estimated median completion time of 20 hours by humans, GPT-5.5 scored 73.1%, also higher than GPT-5.4's 68.5%.
In the SWE-Bench Pro evaluation, which is widely recognized in the industry as the most reflective of the ability to solve real GitHub problems, GPT-5.5 scored 58.6%, slightly inferior to Claude Opus 4.7 (64.3%).
However, OpenAI marked an asterisk next to this data, stating that "Anthropic reported signs of overfitting (memory) on some subsets of problems."
In other words, although Opus 4.7 has good test scores, I suspect you've memorized the answers.
A Codex researcher said bluntly: SWE-Bench can no longer measure top-level programming capabilities.
The most crucial thing is that in these three evaluations, GPT-5.5 used fewer tokens but still comprehensively outperformed GPT-5.4.
This ability is more evident in Codex.
It can complete 'end-to-end' programming tasks, including implementation, refactoring, debugging, testing, and verification processes.
For example, let GPT-5.5 create a visualization application for the Artemis II space mission.
First, throw a screenshot of the mission to GPT-5.5, and then ask it to implement an interactive 3D orbit simulator using WebGL and Vite. The trajectory data must come from the real vector data of NASA/JPL Horizons, and it must also have realistic orbital mechanics.
Behold, GPT-5.5 built it from scratch. You can drag the mouse to rotate, and the relative positions of the Orion spacecraft, the moon, and the sun are all correct.
Let's try another example: a tank shooting at flying saucers.
The prompt asks to create a UFO shooting game using Three.js. The player controls a tank to shoot down flying saucers overhead. It should be 'low-polygon but good-looking'. First, provide the complete file structure and the list of files to be modified, and then write all the code. 'Don't stop until it's done'.
GPT-5.5 executed all the requirements. From the file structure to Three.js rendering to shooting judgment, it delivered a playable 3D game in one go.
In the 3D dungeon arena, Codex took care of the game architecture, TypeScript/Three.js implementation, combat system, enemy encounters, and HUD feedback.
GPT generated the environment texture, the OpenAI API generated the character dialogue, and the character models, textures, and animations came from third-party material tools. Several AIs each took care of their own tasks and pieced together a game where you can fight monsters.
Early testers said bluntly that GPT - 5.5 has a stronger ability to understand the system's form.
It can better determine where the problem lies, where to make repairs, and which other parts of the codebase will be affected.
85% of OpenAI employees are going crazy for it. This is the real working AI
Beyond programming, GPT-5.5's data in 'knowledge-based work' is also impressive.
After all, OpenAI calls it 'a new type of intelligence for real work'.
It can understand what you want to do more quickly and switch between different tools until the task is completed.
In the GDPval evaluation, which assesses the level of AI in completing standardized knowledge work in 44 occupations, GPT-5.5 scored 84.9%, Opus 4.7 scored 80.3%, and Gemini 3.1 Pro only scored 67.3%.
In the OSWorld-Verified test, which checks whether the model can independently operate in a real computer environment, GPT-5.5 scored 78.7%, almost tying with Opus 4.7's 78.0%.
In the Tau2-bench test, which assesses the model's ability to handle multi-round conversations, query the system, and perform operations in a complex customer service workflow, GPT-5.5 reached 98.0% without fine-tuning the prompt.
Interestingly, let's see how OpenAI uses it. According to the official blog, more than 85% of the company's employees use Codex across departments every week.
The public relations department used GPT-5.5 to analyze six months of speech invitation data, built a scoring and risk framework, and let low-risk requests be automatically handled by the Slack AI agent.
The finance department reviewed 24,771 K-1 tax forms, a total of 71,637 pages, and completed the task two weeks earlier than last year.
The marketing team achieved automatic generation of weekly business reports, saving 5 to 10 hours per week.
Now, in Codex, you can directly interact with web applications through GPT-5.5, test processes, click on pages, take screenshots, and continuously iterate based on what you see until the task is completed.
Here is an example of testing the onboarding process.