GPT-5.4 Mini + Nano Launches Surprise Attack: Fully Restores "Lobster" at One-Third the Price, OpenAI Goes Wild

Cut the "shrimp farming fee" to a fraction of the original amount.

Late at night, OpenAI unleashed "Gemini" GPT-5.4 mini and nano. Their capabilities are approaching the full - fledged version, with top - notch speed and cost - effectiveness. They are truly excellent for coding and serving as the main models for "Lobster"!

OpenAI quietly dropped another bombshell.

Today, GPT-5.4 mini and GPT-5.4 nano were officially released.

There was no pre - heating, no countdown, and they were directly launched.

The problems these two models aim to solve are clear: In a real production environment, how can AI work quickly, accurately, and inexpensively?

They inherit the core advantages of GPT-5.4, with maximum speed and lower costs. They are the pinnacle of lightweight models.

Let's start with the most astonishing figures —

Coding (SWE - Bench Pro): GPT-5.4 mini achieved 54.4%, while the full - fledged GPT-5.4 got 57.7%;

Computer Usage (OSWorld - Verified): GPT-5.4 mini scored 72.1%, comparable to GPT-5.4 (75%)

Additionally, in tasks such as reasoning and tool invocation, the capabilities of the mini version are directly approaching those of GPT-5.4.

Moreover, compared with the previous - generation GPT-5 mini, the running speed of GPT-5.4 mini has directly soared by 2 times!

Netizens directly stated that the mini and nano versions can completely serve as the main models for "Lobster"!

GPT-5.4 mini has a large context of 400k. The input price is $0.75 per million tokens, and the output price is $4.5 per million tokens;

The input price of GPT-5.4 nano is $0.2 per million tokens, and the output price is $1.25 per million tokens.

Compared with GPT-5.4, the output price of the mini version is 1/3 of it, while that of the nano version is only 1/12.

Now, the three characteristics of speed, strength, and affordability are all realized.

Half a year ago, this was completely impossible.

Some people exclaimed after trying it that it's simply amazing! It's not only fast but also 9 times cheaper than Claude 4.6 Opus.

Terrifying Evolution of Code

The mini version catches up with the "full - fledged" one, and the nano version outperforms the previous generation

Let's first look at coding.

SWE - Bench Pro is one of the most rigorous benchmarks for measuring the "real coding ability" of large models. It doesn't test with fill - in - the - blank questions but asks the model to directly fix real software bugs on GitHub.

GPT-5.4 mini achieved 54.4%, only 3.3% behind the full - fledged GPT-5.4 (57.7%).

This means that a small model optimized for speed and cost has reached the ceiling of the flagship model when solving real engineering problems.

The previous - generation GPT-5 mini only got 45.7%. There is a nearly 9% leap between the two generations of the mini version.

The gap in Terminal - Bench 2.0 is even more exaggerated. GPT-5.4 mini achieved 60.0%, while GPT-5 mini only got 38.2%, with an improvement of over 57%.

Even the smallest nano version scored 52.4% on SWE - Bench Pro, nearly 7% higher than the previous - generation mini version.

An ultra - lightweight model positioned for "classification and data extraction" actually outperforms the previous - generation medium - weight model in coding ability. This is the evolution speed of distilled models in the past few months.

For developers, the practical meaning of this set of data is very straightforward:

For coding tasks that don't require the "full - power thinking" of the flagship model, such as targeted code modification, front - end page generation, debugging loops, and code library retrieval, they can now all be handed over to the mini version. It's twice as fast, much lower in cost, and the effect is almost the same.

Doctor - level Reasoning and Complex Tool Invocation: A Double Win

Coding is just one aspect. The reasoning and tool - invocation abilities determine whether a model can truly "work".

GPQA Diamond is a doctor - level scientific reasoning benchmark. GPT-5.4 mini scored 88%, only 5% behind GPT-5.4.

What's more worthy of attention is the "tool - invocation" ability.

Toolathlon mainly tests the model's performance in complex tool chains. It's not just about invoking an API once but correctly combining, sequencing, and using multiple tools in multi - step tasks.

As a result, GPT-5.4 mini scored 42.9%, completely crushing GPT-5 mini (26.9%).

Moreover, on the telecommunications - industry - specific benchmark τ2 - bench, the mini version scored an ultra - high 93.4%, almost catching up with the full - fledged version's 98.9% and leaving GPT-5 mini (74.1%) far behind.

On another tool - invocation benchmark, MCP Atlas, GPT-5.4 mini got 57.7%, while GPT-5 mini only got 47.6%.

These figures sum up to one thing: GPT-5.4 mini is not just a "shrunken smart model". It's a real executor that can independently complete complex task chains in a production environment.

The Main Model for "Lobster"

Small Models Can Also "Work by Looking at the Screen"

What really surprises people about GPT-5.4 mini is its performance in computer usage.

How do people use a computer? They look at the UI elements on the screen with their eyes, use their brains to decide where to click, and use their hands to operate the mouse and keyboard.

If an AI is to truly become your "cyber assistant", it also needs to learn this — quickly parse an information - dense screen screenshot, locate buttons, input boxes, and data lists, and then make the correct operations.

OSWorld - Verified measures this comprehensive ability of "visual understanding + reasoning + operation".

On this list, GPT-5.4 mini scored 72.1%, while the flagship GPT-5.4 scored 75.0%. The gap is less than 3 percentage points.

In contrast, GPT-5 mini only scored 42.0%. The computer - usage ability has almost doubled in one generation.

However, the nano version only scored 39.0% in this test, even slightly lower than the previous - generation GPT-5 mini's 42.0%.

This shows that computer - usage tasks have high requirements for the model's visual reasoning ability. It's not just about shrinking the model: there is a clear gap in capabilities between the mini and nano versions.

On MMMUPro (including Python tools), the mini version scored 78.0%, and the flagship version scored 81.5%. The gap is also very small.

This benchmark covers a large number of complex questions that require reasoning by combining visual information and mathematical/code tools.

This set of results is of great significance for a specific direction: AI Agent.

When a small model can quickly parse an information - dense UI screenshot and make correct operation decisions with low latency, it becomes an ideal engine for building real - time computer - usage agents — low cost, fast response, and sufficient capabilities.

In the latest interview of TBPN, Altman clarified the next - step vision:

OpenAI will launch an evolved version of Codex. The new version will not be limited to programming but will evolve into a powerful tool for "controlling computers".

In his vision, people can start and manage new tasks entirely through their mobile phones. The real ultimate experience is to have a personal - exclusive AI based on a unified backend.

It can access all personal data, ideas, materials, and memories and can seamlessly execute tasks across multiple terminals.

Sub - Agent Paradigm

Large Model for Decision - Making, Small Model for Execution

In this release, OpenAI spent a lot of time elaborating on an idea: The best AI system doesn't necessarily need to use the largest model to handle everything.

The architecture they proposed is very clear:

The flagship model GPT-5.4 is responsible for planning, coordination, and final decision - making, and then distributes specific tasks to the GPT-5.4 mini sub - agents for parallel execution.

Searching code libraries, reviewing large files, and processing support documents, these tasks that don't require "deep thinking" but need to be "completed quickly" are all handed over to the mini version.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。