StartseiteArtikel

The world is still in a frenzy over "lobsters," while the war for the "AI operating system" has quietly begun.

锦缎2026-03-09 08:23
OpenAI starts to build an "AI operating system", and the world is still unaware.

GPT-5.4, which OpenAI had been hyping up both explicitly and implicitly for a long time, finally made its official debut last Friday. Needless to say, its capabilities have been improved. Interestingly, the newly released version is closely related to the currently popular application "OpenClaw". The core of all this lies in a key capability repeatedly emphasized in OpenAI's official introduction: "Computer-Use".

Before we proceed, let's present the core idea, which is also the main point this article aims to convey: Through GPT-5.4, we can see that what OpenAI is building is no longer just a smarter chat model, but a brand-new "AI Operating System" (AI OS).

From long context handling, tool invocation to native computer control, all these are paving the way for this "operating system". While the world is still cheering for the popularity of OpenClaw and getting excited about the concept of Agent, OpenAI has already built the core capability of Agent (Computer-Use) into the underlying layer of the model.

The world may not be aware of it yet, but we are standing at the starting point of a new era: AI is about to transform from a "product application" to an "operating platform".

01 The Core of the "Operating System": Reasoning + Coding + Workflow

Compared with Google's Gemini, which is proficient in world knowledge, OpenAI's ChatGPT series is often defined as the "science student".

Although its ability to provide emotional value has slightly decreased since the upgrade to GPT-5, its programming and mathematical abilities are still extremely excellent.

This time, in order to enable the AI with overflowing capabilities to be successfully applied in the Agent era, GPT-5.4 has achieved a core technological breakthrough:

Integrating the capabilities of reasoning, coding, and agent workflow into a single model architecture.

In simple terms, GPT-5.4 is more versatile and has stronger capabilities in specific fields. It is no longer a single-function tool, but an "operating system kernel" with general capabilities.

At the reasoning level, to better implement it in the application level and enable the model to perform complex tasks, OpenAI has specifically strengthened GPT-5.4's context understanding ability.

Facing complex tasks of up to 1 million tokens (equivalent to being able to process an entire set of project documents or long-term financial records at once), the model can integrate massive amounts of data and correctly deduplicate information. The error rate for single-fact claims has decreased by 33% compared to GPT-5.2, and the output in highly professional scenarios is more reliable.

In addition, GPT-5.4 can now support a 1M context window in CodeX, but users need to manually set it in config.toml. Otherwise, the default is still 256k.

Specifically in knowledge work, in the GDPval benchmark test for 44 occupations, GPT-5.4 can reach or exceed the level of industry experts in more than 83% of scenarios.

Compared with the 70.9% level of GPT-5.2, this improvement is quite significant. However, it is a bit confusing that the Pro version of GPT-5.4 performs slightly worse than GPT-5.4. (The official explanation is that the Pro version focuses more on the stability of extremely complex tasks rather than the average score in general scenarios.)

To better integrate GPT-5.4 into people's actual work scenarios, OpenAI has intuitively demonstrated the professional performance of the new version model in three scenarios: spreadsheets, documents, and slides in the official introduction:

In addition, the significant progress made by GPT-5.4 has also played a crucial role in high-professional fields such as finance and law.

Feedback from multiple international institutions shows that while the new model has improved the accuracy in financial modeling, contract analysis, and long-term task execution, the frequency of user interaction with the AI has also decreased significantly, significantly shortening the task completion time.

In the coding scenario that developers are most concerned about, GPT-5.4 maintains the ability to generate code of the same quality as GPT-5.3-Codex, without significant improvement. However, the newly added "/fast" mode can achieve about 1.5 times the token generation speed.

In terms of agents, the tool invocation ability is the core for agents to complete tasks. The newly introduced "Tool Search" mechanism allows the model to invoke capabilities on demand in a large ecosystem of tens of thousands of tools. While the accuracy remains the same, the token consumption has decreased astonishingly by 47%.

This is exactly how the "operating system" schedules underlying resources, efficiently and accurately.

02 Native Computer Operation: From Understanding to Execution, This is the Interface of the "Operating System"

The form of AI has evolved from large language models to agents. To achieve product commercialization, AI must be able to truly help people do things.

Therefore, AI companies around the world have all turned their attention to the control of users' PCs.

However, after the release of various desktop agents for some time, the download rate and retention rate are actually not ideal. Even for the 956 million monthly active users of ChatGPT, many people are reluctant to download a separate desktop agent software.

People are already used to chatting with AI (large language models), but they are not yet used to letting AI (agent agents) take over the computer.

So, OpenAI came up with a genius idea: Let the large model that users use every day control the computer without the need for special download and installation.

Therefore, GPT-5.4 has naturally become the first general model with native computer operation capabilities.

The principle is actually not complicated. It can issue mouse and keyboard instructions based on screen screenshots, and can also write code through libraries such as Playwright to directly operate the software system.

Different from PC-side agent assistants that require special training to use, GPT-5.4 chooses to build the ability to control the computer directly into the general architecture. Developers can seamlessly switch between reasoning, coding, or task execution in the same model. Just like an "operating system" naturally has the drivers for underlying hardware (keyboard, mouse, screen).

When it comes to computer control, security issues cannot be avoided.

The behavior of GPT-5.4 can be finely adjusted through developers' input to meet the needs of different application scenarios.

To ensure security, developers can configure custom security confirmation strategies and set different operation confirmation mechanisms according to the risk level of tasks.

Low-risk tasks such as data query and code writing can be set to execute automatically, while fund operations and file deletion/modification must be manually confirmed. This can not only ensure system security but also improve the execution efficiency of the workflow.

In the OSWorld-Verified benchmark test, GPT-5.4 achieved a success rate of 75%, surpassing the human benchmark (72.4%) and significantly leading the previous generation GPT-5.2's 47.3% level, which is sufficient to prove the practicality and reliability of the new model in PC-side task scenarios.

In terms of browser automation, in the WebArena-Verified and Online-Mind2Web tests, GPT-5.4 obtained success rates of 67.3% and 92.8% respectively by relying mainly on the screenshot method.

This means that even without accessing the underlying architecture of the web page, the model can complete complex web interaction tasks based solely on visual information, which is mainly due to the systematic improvement of the underlying visual perception ability.

The improvement in the traditional multimodal field is relatively small. In the MMMU-Pro visual understanding and reasoning test, the model's accuracy increased from 79.5% to 81.2%; however, the recognition ability for structured information has significantly improved. The average error rate of the model in the OmniDocBench document parsing benchmark test decreased from 0.140 to 0.109.

That is to say, the model is better at handling complex file types commonly found in work environments such as PDFs and scanned documents, and will not be helpless when encountering tables and illustrations as before.

To meet the needs of high-density interfaces and fine operations, the newly added "original" image input level of GPT-5.4 supports full-fidelity perception of up to 10.24 million pixels.

According to user feedback, when the model processes complex interfaces such as enterprise-level ERP systems, financial statements, or engineering design software, the positioning accuracy of interface elements and the success rate of click operations in high-resolution mode have significantly improved.

03 Actual Test: The Battle of Operating Systems Starts with an Expensive Ticket

In this official introduction, OpenAI tries to prove the model's powerful capabilities with the scores of a large number of benchmark tests and the professional evaluations of well-known institutions.

Although people generally don't trust the scores of benchmark tests very much, the results of several actual tests have proved that what OpenAI said is true.

First, in the evaluation list of Artificial Analysis, as expected, it topped the list in terms of intelligence, coding ability, and agent ability at the same time:

If this is not convincing enough, you can also take a look at a comprehensive test on the X platform:

Original video link: https://x.com/angaisb_/status/2029635731585372598?s=46&t=E5aK_KpbsE6EAIfDJWZvzQ

This is a Minecraft game written by user @Angaisb_ on the X platform using GPT-5.4. Whether it is the action logic of the first-person perspective (running, jumping, building) or the texture and aesthetics of the blocks in the game, it is almost impeccable.

The content shown in a demo is almost on par with the quality of the early versions of Minecraft.

It can be seen that the functions of GPT-5.4 are truly powerful and indeed have considerable practical value.

But as the saying goes, you get what you pay for. Such powerful functions naturally mean extremely high costs.

Compared with GPT-5.2, the price increase is quite astonishing. Some users even said that within a few hours after the model was released, just by saying hello and asking a question, hundreds of dollars disappeared.

Such powerful capabilities and high pricing seem to be a bit self - contradictory to OpenAI's official definition of "overflowing capabilities".

Now, OpenClaw has driven the popularity of domestic large models with extremely low token costs, and GPT series products have fallen out of the top ten in the usage ranking. Why does OpenAI still dare to set such a high price for GPT-5.4?

The shortage of computing power resources goes without saying, but the deeper answer may lie in the subtle shift of OpenAI's recent commercialization strategy.

It is reported that OpenAI is reducing the direct purchase options within the ChatGPT application. Instead of using the chat interface as the core scenario for closed-loop transactions, it gives priority to supporting external applications to handle purchase behaviors.

This indicates that OpenAI is shifting from "direct monetization to consumers" to "indirect profit through the ecosystem".

OpenAI positions GPT-5.4 as a professional infrastructure, screening out high-value customers through capability premiums. The monetization needs of ordinary users are taken over by third - party ecosystems such as Notion and Cursor that integrate ChatGPT capabilities. Users can indirectly experience the model's capabilities through partners' products without directly bearing the high API costs.

Friends who are familiar with desktop intelligent agents may notice