Ultraman stört GPT - 5.4 - Präsentation: Netizens beklagen 80 - US - Dollar

Ein Modell, das alles erledigt.

Every time you open an AI tool, you probably have to think for a second: Which model is best suited for this task? For writing code, it's one model; for research, it's another. And if you want to ask the AI to operate your computer, you have to open yet another window.

As of today, there is finally a solution to this fragmentation problem.

Just now, OpenAI officially introduced GPT - 5.4, which integrates programming, logical thinking, computer control, web search, and a context of one million tokens into a single model without sacrificing performance in any of these areas.

OpenAI CEO Sam Altman also posted a short tweet on platform X, highlighting five key points: Stronger performance in knowledge work, better web search, native computer control, support for a context of one million tokens, and the ability to intervene during the response time.

These few words exactly correspond to the five most frequently occurring problems in the implementation of AI applications in the past two years.

Knowledge Work: In eight out of ten cases, the AI beats professionals

To understand the improvements of GPT - 5.4 in knowledge work, one must first know the design logic of the GDPval criterion.

It covers nine industries and 44 occupations that contribute the most to the US GDP. The tasks are those that occur daily in the workplace: Writing financial models for investment banks, planning emergency shifts in hospitals, and creating presentations for sales teams.

After completing the tasks, the results are blindly evaluated by real industry professionals to determine what percentage of human colleagues the AI beats.

The result of GPT - 5.4 is 83.0%, which means that in more than eight out of ten comparisons, professionals believe that the performance of the AI reaches or exceeds that of human colleagues. For its predecessor, GPT - 5.2, the value was 70.9%, which means a difference of almost 13 percentage points.

The improvement is most obvious in the creation of spreadsheet models. GPT - 5.4 achieves an average value of 87.3% when simulating the tasks of an entry - level analyst in an investment bank, GPT - 5.2 is at 68.4%, and GPT - 5.3 - Codex is at 79.3%, which means a difference of almost 20 percentage points.

The test results of the legal platform Harvey's BigLaw Bench are also impressive. GPT - 5.4 reaches a value of 91% and also wins in the APEX - Agents criterion of the professional service evaluation platform Mercor.

The accuracy is also remarkable. The problem of hallucinations has been the biggest obstacle to the integration of AI into professional applications. Every one - percentage - point reduction means that the AI can be safely used in more scenarios.

The data shows that the probability that GPT - 5.4 makes an error in a single statement is 33% lower than that of GPT - 5.2, and the probability that a complete response contains errors is 18% lower.

Programming: One model for writing and testing code

GPT - 5.4 integrates the programming capabilities of GPT - 5.3 - Codex into the main model. For developers, this means that they no longer have to open a separate model for writing code, and the programming capabilities are not compromised.

In the SWE - Bench Pro, which tests real software development projects, GPT - 5.4 reaches a value of 57.7%, GPT - 5.3 - Codex is at 56.8%, and GPT - 5.2 is at 55.6%. After the integration, the programming performance increases, and the model also gains a series of general capabilities such as computer control without obvious weaknesses.

The well - known AI test blogger Dan Shipper wrote after a test: "This is the best planning ability we've seen from OpenAI recently. The code review is also strong, and the costs are about half of Opus."

He highlighted two specific aspects. First, the planning ability is the key to success in long - term tasks, and GPT - 5.4 is much better organized in task division and continuation. Second, the approximately half - the - cost compared to Claude Opus means a significant difference in the bill for developers who need a large number of API calls.

After activating the /fast mode in Codex, the token generation speed of GPT - 5.4 can be increased by up to 1.5 times, which allows users to maintain a smooth workflow during coding, iteration, and debugging.

In addition, the newly introduced experimental function Playwright Interactive enhances the programming experience with GPT - 5.4.

When creating web or Electron applications, GPT - 5.4 can debug in real - time via a visual browser. The model can write code and test the application it creates at the same time, taking on the roles of both developer and tester.

OpenAI showed a typical example: With just a simple hint, GPT - 5.4 created a complete isometric theme - park simulation game, which includes a tile - based path - laying and attraction - building system, an AI for visitor path - finding and queue behavior, and a comprehensive score with four indicators (money, number of visitors, satisfaction, cleanliness), all updated in real - time.

Playwright Interactive conducted several automated tests during the entire process to check the correctness of path - laying, camera navigation, visitor reactions, and UI indicators. The model completed the entire process from code creation to test verification independently.

The blogger Angel also wrote a Minecraft clone version with GPT - 5.4. The model took about 24 minutes, and the game runs smoothly without any glitches. He wrote in a tweet: "Minecraft is practically defeated. Now I have to come up with a new test task."

Wharton School professor Ethan Mollick also had access to the early versions of GPT - 5.4. With the same hint, he had GPT - 5.4 Pro create a three - dimensional space scene inspired by "Piranesi" without any errors. He only gave one additional command: "Make it better." Then he compared the result with the version created by GPT - 4 two years ago, and the difference is obvious.

Computer Control: It can do better than you

This is the most remarkable change in the release of GPT - 5.4. So far, OpenAI's computer control ability has been an independent module, separate from the model's language understanding and code - generation ability.

The two systems worked independently of each other, and the information transfer was inefficient. Now this separation is removed, and GPT - 5.4 uses its own logical thinking ability to control the computer without any detours.

This is also OpenAI's first product to integrate the computer control ability into the general model. In the future, this will probably be a new starting point for the development of AI agents.

The test results of the OSWorld - Verified criterion for testing desktop navigation ability show that GPT - 5.4 achieves a success rate of 75.0% in performing real operating - system tasks using screenshots and mouse and keyboard interactions. The human baseline value is 72.4%, and GPT - 5.2 only reached 47.3%.

In short, it has reached and even surpassed human performance.

In the online Mind2Web criterion, which tests browser control using only screenshots, GPT - 5.4 reaches 92.8%, while the Agent Mode of ChatGPT Atlas only reaches 70.9%.

Real - world use cases illustrate this even better. Mainstay used GPT - 5.4 for automatic form filling on about 30,000 property tax portals. The first - attempt success rate is 95%, and the success rate within three attempts is 100%, while the success rate of previous models was only between 73% and 79%. The speed of handling conversations has increased by about three times, and the token consumption has decreased by about 70%.

This is closely related to the improvement of visual perception ability. Computer control ultimately requires that one "sees the user interface correctly" – that one knows what is on the surface, where the buttons are, and whether the click is correct.

GPT - 5.4 has made special improvements in this area and introduced an original - image input mode, which supports the input of high - resolution images with up to 10.24 million pixels or a maximum edge length of 6,000 pixels. The previous one...

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Gerade hat Ultraman die GPT-5.4 vorgestellt und die Veranstaltung gestört. Netizens: Ein einfaches "Hi" kostet 80 US-Dollar.

Knowledge Work: In eight out of ten cases, the AI beats professionals

Programming: One model for writing and testing code

Computer Control: It can do better than you