Altman Interrupts Event to Announce GPT - 5.4: Netizens Say "Hi" Costs $80

One model, all tasks covered.

Every time you open an AI tool, you probably need to think for a second: which model should I use for this task? One for writing code, another for researching information, and if you want the AI to operate your computer, you have to open another window.

After today, there's finally an answer to this fragmented experience.

Just now, OpenAI officially released GPT-5.4, integrating programming, reasoning, computer operation, web search, and a context of one million tokens into the same model, without sacrificing the capabilities of any single aspect for the sake of integration.

OpenAI CEO Sam Altman also posted a short tweet on the X platform, highlighting five directions: stronger knowledge work capabilities, better web search, native computer operation, support for a one-million-token context, and the ability to intervene at any time during the response process.

In just a few words, these exactly correspond to the five most prominent pain points in the implementation of AI applications over the past two years.

Knowledge Work: Eight out of ten times, AI outperforms professionals

To understand the progress of GPT-5.4 in knowledge work, you need to first understand the design logic of the GDPval benchmark.

It covers 9 industries and 44 occupations that contribute the most to the US GDP. The tasks are real-world work that happens every day in the workplace: writing financial models for investment banks, scheduling emergency shifts for hospitals, and creating presentations for sales teams.

After the tasks are completed, the output results are blindly evaluated by real practitioners in the industry to see what percentage of human counterparts the AI's output can surpass.

The result for GPT-5.4 is 83.0%, which means that in more than eight out of ten comparisons, industry professionals believe that the AI's output meets or exceeds the level of human counterparts. The previous version, GPT-5.2, was at 70.9%, a difference of nearly 13 percentage points.

The progress is most evident in spreadsheet modeling. GPT-5.4 simulates the modeling tasks of junior investment bank analysts and achieves an average score of 87.3%, while GPT-5.2 scores 68.4% and GPT-5.3-Codex scores 79.3%, a difference of nearly 20 percentage points.

The test results of the BigLaw Bench on the legal platform Harvey are also impressive. GPT-5.4 scores 91% and also takes the first place in the APEX-Agents benchmark of the professional service evaluation platform Mercor.

The aspect of accuracy is also worth noting. The hallucination problem has always been the biggest obstacle for AI to enter professional scenarios. Every one percentage point reduction means that it can be used more confidently in more scenarios.

Data shows that compared with GPT-5.2, the probability of a single statement being incorrect in GPT-5.4 is 33% lower, and the probability of a complete response containing errors is 18% lower.

Programming: One model to write and test code

GPT-5.4 integrates the programming capabilities of GPT-5.3-Codex into the mainline. For developers, this means you no longer need to open a separate model just for writing code, and the programming capabilities themselves are not compromised.

The SWE-Bench Pro specifically tests real software engineering tasks. GPT-5.4 scores 57.7%, GPT-5.3-Codex scores 56.8%, and GPT-5.2 scores 55.6%. After integration, the programming score not only doesn't decrease but increases. At the same time, it also gains a set of general capabilities such as computer operation, with almost no obvious weaknesses.

Well-known AI evaluation blogger Dan Shipper wrote after trying it: "This is the most outstanding planning ability we've seen from OpenAI recently. The code review is also very strong, and the cost is about half of Opus."

He pointed out two specific dimensions. Firstly, planning ability is the key to the success of long-term tasks, and GPT-5.4 is significantly more organized in task decomposition and continuous progress. Secondly, compared with Claude Opus, the cost is about half, which will be very obvious on the bill for developers who need to make large-scale API calls.

After enabling the /fast mode in Codex, the token generation speed of GPT‑5.4 can be increased by up to 1.5 times, allowing users to maintain a smooth working state during the coding, iteration, and debugging processes.

Meanwhile, the newly launched experimental feature, Playwright Interactive, takes the programming experience of GPT-5.4 a step further.

When building a Web or Electron application, GPT-5.4 can perform real-time debugging through a visual browser. The model can write code and test the application it is building simultaneously, taking on the roles of both a developer and a tester.

OpenAI demonstrated a typical case: with just a single lightweight prompt, GPT-5.4 generated a complete isometric perspective theme park simulation game, including a tile-based path-laying and attraction construction system, AI pathfinding and queuing behavior for tourists, and a comprehensive score with real-time dynamic updates for four indicators: funds, number of tourists, satisfaction, and cleanliness.

Playwright Interactive carried out multiple rounds of automated testing throughout the process, verifying the correctness of path-laying, camera navigation, tourist responses, and UI indicators. The model completed the entire process from writing code to testing and acceptance by itself.

Blogger Angel also used GPT-5.4 to write a clone of Minecraft. The model took about 24 minutes, and the game ran smoothly without any glitches. He wrote in his tweet, "Minecraft is basically conquered. I need to find a new test now."

Wharton School professor Ethan Mollick also got early access. He used the same prompt to let GPT-5.4 Pro generate a three-dimensional space scene inspired by "Piranesi". There were no errors throughout the process, and he only added an additional instruction, "Make it better." He then placed the result side by side with the version generated by GPT-4 two years ago, and the difference was obvious at a glance.

It can now operate the computer better than you

This is the most notable change in the release of GPT-5.4. Previously, OpenAI's computer operation ability was an independent module, with an obvious separation between the model's language understanding and code generation.

The two systems operated independently, and information had to be transferred back and forth, which naturally reduced efficiency. Now, this separation is gone. When GPT-5.4 operates the computer, it uses the model's own reasoning ability without any detours.

This is also OpenAI's first product to natively integrate computer use ability into a general model. In the future, when talking about AI Agents, this will surely be a new starting point.

Benchmark test results show that in the OSWorld-Verified benchmark test for desktop navigation ability, real operating system tasks are completed through screenshots and mouse-keyboard interactions. GPT-5.4 achieves a success rate of 75.0%, the human baseline is 72.4%, and GPT-5.2 is 47.3%.

In short, it not only catches up with but also surpasses humans.

In the Online-Mind2Web benchmark, which tests browser operation using only the screenshot mode, GPT-5.4 achieves 92.8%, while the Agent Mode of ChatGPT Atlas, the comparison object, is 70.9%.

Real deployment cases are more illustrative. Mainstay uses GPT-5.4 for automatic form filling on about 30,000 property tax portals. The first-time success rate reaches 95%, and the success rate within three attempts is 100%, while the previous similar models were only between 73% and 79%. The conversation completion speed is increased by about three times, and the token consumption is reduced by about 70%.

This improvement is inseparable from the enhancement of visual perception ability. Operating a computer ultimately requires "seeing clearly" - seeing what's on the interface, where the buttons are, and whether the clicks are accurate.

GPT-5.4 has made special enhancements in this regard, introducing the original image input mode, which supports high-fidelity image input of up to 10.24 million pixels or a maximum side length of 6000 pixels. The upper limit of the existing high-definition mode has also been increased from the previous standard to 2.56 million pixels or a maximum side length of 2048 pixels.

Tool Invocation and Web Search: Persistence is the core competitiveness

Behind a complex AI Agent system, there may be dozens of MCP tools. In the past, the approach was to stuff all the tool descriptions into the model at the beginning of each conversation, regardless of whether they were needed or not, and spend the tokens first.

GPT-5.4 takes a different approach: first, give the model a simple tool list (i.e., introduce a tool search mechanism). When a specific tool is needed, retrieve the detailed description of that tool. Tools that have been used once can be directly cached and don't need to be retrieved again.

In a test of 250 tasks, with a full configuration of 36 MCP servers, the tool search mode reduces the total token consumption by 47% while keeping the accuracy completely unchanged. Nearly half of the cost is saved without any loss in precision.

In terms of web search, GPT-5.4 scores 82.7% on the BrowseComp benchmark, 17 percentage points higher than GPT-5.2's 65.8%. The Pro version even reaches 89.3%, setting the highest score in the industry. The CEO of Zapier commented that GPT-5.4 will continue to search where other models give up, making it the most persistent model they've tested.

Million-Token Context: Extremely Long

GPT-5.4

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Just now, Altman interrupted the event to announce GPT-5.4. Netizens: Saying "Hi" costs $80.

Knowledge Work: Eight out of ten times, AI outperforms professionals

Programming: One model to write and test code

It can now operate the computer better than you

Tool Invocation and Web Search: Persistence is the core competitiveness

Million-Token Context: Extremely Long