Early Morning: GPT - 5.5 Recaptures Lost Ground, Anthropic Takes Urgent Action

It can stably and autonomously operate long tasks for nearly 10 hours.

According to a report by Zhidx on April 24th, early this morning, OpenAI officially released its intelligent agent programming model, GPT-5.5!

GPT-5.5 can understand users' goals more quickly and is proficient in writing and debugging code, conducting online research, analyzing data, creating documents and spreadsheets, and collaborating across multiple tools.

▲ OpenAI's official tweet (Link: https://x.com/OpenAI/status/2047376561205325845)

The OpenAI team described it as "the smartest and most intuitive model we've ever developed, and an important step towards a new way of getting work done on computers."

Sam Altman himself commented on the model: "In my experience, it 'knows what to do'."

In terms of performance, the improvements of GPT-5.5 are particularly significant in areas such as agent programming, computer usage, knowledge work, and early scientific research – progress in these areas relies on cross-context reasoning and continuous autonomous action.

In terms of programming ability, GPT-5.5 comprehensively exceeds Gemini 3.1 Pro. In professional tasks, computer usage and vision, tool usage, and abstract reasoning, its scores on most test sets are higher than those of Claude Opus 4.7 and Gemini 3.1 Pro.

However, in terms of academic and tool usage abilities, there is no significant gap between GPT-5.5 and Claude Opus 4.7 or Gemini 3.1 Pro.

In terms of speed, GPT-5.5 maintains a similar per-token latency to GPT-5.4 in actual services while achieving a higher level of intelligence. When completing the same Codex tasks, GPT-5.5 uses significantly fewer tokens, making it more efficient and capable.

As soon as the model was released, many netizens who participated in the internal testing shared their experiences.

Pietro Schirano, the creator of the open-source project Claude Engineer and the CEO of the AI design assistant MagicPath, shared that GPT-5.5 took only about 20 minutes to automatically compare the code differences between two versions of his project, create a new branch based on the official version, and perfectly merge all the changes from other branches.

He also used GPT-5.5 to generate a playable 3D shooting game in one go. The game has a smooth operation experience, and every graphic was generated from scratch using Three.js.

In addition, Pietro Schirano used GPT-5.5 to connect via USB to create applications for his Flipper Zero and successfully pushed them to the device.

Pietro Schirano sighed: "GPT-5.5 is the most powerful tool I've ever used. For the first time, I feel that I'm no longer limited by the model's capabilities, but only by my imagination. Training workflows, impossible optimizations, and hardware experiments via USB. The era of Vibe hardware has begun."

AI engineer Peter Gostev had an in-depth experience with GPT-5.5 and shared several examples of his work with it. He said that users can set step-by-step prompts for GPT-5.5, and it will complete tasks step by step. He personally tested that it can run autonomously and stably for at least 7 hours.

Peter Gostev asked GPT-5.5 to create a toy railway in London with landmarks and seasonal changes, and the model completed the task excellently in one go. He found that compared with GPT-5.4, the works generated by GPT-5.5 are more ambitious in concept, more coherent in logic, and have fewer errors.

Bartosz Naskręcki, an assistant professor in the Department of Mathematics at the Adam Mickiewicz University in Poznań, Poland, used GPT-5.5 in Codex to build an algebraic geometry application in 11 minutes with just one prompt. This application can visualize the intersection lines of quadratic surfaces and convert the resulting curves into the Weierstrass model.

Subsequently, he expanded the application by adding a more stable singularity visualization function and precise coefficients that can be reused in subsequent work.

Well-known AI evaluation influencer Matthew Berman said that he has been testing GPT-5.5 for the past two weeks. He felt that OpenAI has improved the model's personality, which he believes is to capture more of the personal agent (such as OpenClaw) market. "Its responses are more concise, more human-like, and less formal. It really has its own personality."

In terms of price, GPT-5.5 is priced at $5 per million input tokens and $30 per million output tokens, with a context window of one million tokens. Compared with GPT-5.4, its price is doubled.

GPT-5.5 Pro is priced at $30 per million input tokens and $180 per million output tokens.

Compared with the price of Anthropic's Claude model, the price of GPT-5.5 is almost the same as that of Opus 4.7, but it is $5 more expensive per million output tokens.

Today, GPT-5.5 is being gradually rolled out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex, while GPT-5.5 Pro is being rolled out to Pro, Business, and Enterprise users in ChatGPT.

In ChatGPT, GPT-5.5 Thinking is available to Plus, Pro, Business, and Enterprise users. For API developers, gpt-5.5 will soon be available in the Responses API and Chat Completions API.

When GPT-5.5 was released, Claude Code had recently faced complaints about its deteriorating performance. Perhaps feeling the pressure from GPT-5.5, Anthropic issued a long post today announcing that it has fixed the intelligence degradation issue and reset the usage limits for all subscribed users as of today.

01 Ranking First in Agentic Coding, with Half the Cost of Competitors

The OpenAI team said that GPT-5.5 is the most powerful Agentic Coding model OpenAI has ever developed.

The Artificial Analysis Intelligence Index⁠ is a weighted average of 10 evaluations run by a third-party organization, including: AA-LCR, AA-Omniscience, CritPt, GDPval-AA, GPQA Diamond, Humanity’s Last Exam, IFBench, SciCode, Terminal-Bench Hard, and τ²-Bench Telecom.

The official post of Artificial Analysis stated that OpenAI's GPT-5.5 (xhigh) leads in Terminal-Bench Hard, GDPval-AA, and APEX-Agents-AA. The model only lags behind other OpenAI models in CritPt and AA-LCR and ranks second only to Gemini 3.1 Pro Preview in three other evaluations, with an overall first-place ranking.

According to the Artificial Analysis Intelligence Index, while GPT-5.5 has the highest score, its cost is only half that of similar cutting-edge coding models.

In the complex execution test Terminal-Bench 2.0, GPT-5.5 scored 82.7%. In the real-world problem-solving test SWE-Bench Pro, it achieved a score of 58.6%, with the number of single end-to-end solved tasks exceeding that of previous models. In the internal long-cycle task test Expert-SWE, GPT-5.5 also outperformed GPT-5.4.

In all three of these evaluations, GPT-5.5 used fewer tokens than GPT-5.4 while achieving higher scores.

The model's advantages in programming are particularly prominent in Codex. In Codex, GPT-5.5 can handle a series of engineering tasks, from implementation and refactoring to debugging, testing, and verification.

Early tests show that GPT-5.5 is better at tasks such as maintaining context understanding in large systems, reasoning about ambiguous faults, verifying hypotheses through tools, and synchronizing changes across the entire relevant codebase.

For example, GPT-5.5 can use the vector data of the Orion spacecraft, the moon, and the sun provided by NASA/JPL Horizons to render the running trajectory and can also perform display zooming:

GPT-5.5 can also create a tracking website that dynamically displays information such as earthquake frequency and location:

With Codex, users can use GPT-5.5 to create playable 3D games:

02 Achieving a 98% Score in Customer Service Tests, Capable of Autonomous Interface Browsing and Tool Operation

Since GPT-5.5 can better understand users' intentions, it can more naturally complete the entire closed-loop of knowledge work: finding information, understanding key points, using tools, checking output results, and transforming raw materials into useful outcomes.

In ChatGPT, GPT-5.5 Thinking (thinking mode) performs excellently in professional tasks such as programming, research, information synthesis and analysis, and document-intensive tasks.

In terms of benchmark tests, in the standardized knowledge work test GDPva, GPT-5.5 scored 84.9%. In the real computer operation test OSWorld-Verified, the model achieved 78.7%. In the customer service test Tau2-bench Telecom, GPT-5.5 scored 98.0% without prompt optimization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

In the early morning, GPT-5.5 recaptured lost ground, and Anthropic took urgent action.

01 Ranking First in Agentic Coding, with Half the Cost of Competitors

02 Achieving a 98% Score in Customer Service Tests, Capable of Autonomous Interface Browsing and Tool Operation