StartseiteArtikel

Nano Banana, OpenAI, du lernst es nicht.

直面AI2025-11-24 17:12
OpenAI kann das Nano Banana Pro nicht nachbauen, aber sie können GPT-5o haben.

Altman has sent an internal letter to all employees of OpenAI. He openly admitted that although OpenAI is still leading, Google is closing the gap. Altman also acknowledged that Google's recent product releases have put considerable pressure on OpenAI.

In fact, as Altman said, this time Google has not only brought the much - watched Gemini 3 Pro, but also the Nano Banana Pro, which has shocked the entire AIGC scene. Previously, the underlying logic of all image - generation models was based on imitating the world. Through huge databases, they searched for the most suitable images and pieced them together.

With the emergence of the Nano Banana Pro, this rule has been completely broken. It does not "draw", but "simulates the physical world". The biggest breakthrough is that it introduces a Chain - of - Thought reasoning mechanism, which makes the model think first and then draw.

Before the first pixel is placed, the model first performs a logical sequence of steps in the latent space, calculates the number of objects, determines the light angles, and plans the spatial relationships. It no longer relies on text as an intermediate step, but the reasoning directly guides the pixel generation in the form of high - dimensional vectors.

Now the question is: Why can't OpenAI develop the Nano Banana Pro?

01

Before we answer the question, let's first look at how the Nano Banana Pro differs from the GPT - 4o that OpenAI currently mainly uses for image generation.

Take the generation task of "three apples" as an example. The prompt is: "The left apple has a bite mark, the middle apple has water droplets on it, and the right apple is rotten." In the face of this instruction, the GPT - 4o usually quickly generates a colorful and well - composed image.

However, when checking the details, the deficiency of probability generation often shows up. The arrangement of the water droplets on the middle apple does not conform to objective rules, and the rot of the right apple seems too deliberate.

In contrast, the number of objects in the image generated by the Nano Banana Pro is exact, and the attributes of each object match precisely - the bite mark on the left, the refracted light effect in the middle, and the oxidation structure on the right are accurately reproduced.

Behind these visible differences are two completely different technical approaches.

The generation mechanism of the GPT - 4o is essentially based on statistical correlation. It searches for visual features of "apple + bite mark" in huge training data and assembles and fuses them through probability distributions. It doesn't really understand the number "three", and it has no physical model for "rotten". It just makes an approximate match based on the feature distances in the high - dimensional space.

The Nano Banana Pro, on the other hand, introduces a Chain - of - Thought (CoT) mechanism, which upgrades the image - generation process from a simple "pixel prediction" to a "logical sequence of steps". Before the first pixel is placed, the model has already completed a symbolic planning: First, the entities (object 1, 2, 3) are determined, then the spatial coordinates are assigned, and finally the physical attributes are bound.

Regarding the "bite mark", the geometric change is derived; regarding the "water droplets", the physical laws of optical reflection and refraction are calculated; and regarding the "rot", the change of material properties is simulated. This is a closed - loop process from semantic understanding through logical planning to the execution of generation.

This mechanism shows its strengths especially when processing complex scenarios related to physical laws.

The prompt is: "A half - full glass of water on the windowsill, and sunlight shines in from the left."

The image generated by the GPT - 4o has visual plausibility, but the lighting conditions are physically inconsistent. On the left side of the windowsill, there should be sunlight reflected by the glass container, but in the image, only the light reflected from the right side is present.

The Nano Banana Pro, on the other hand, first calculates the light vector, derives the shadow direction and the refractive index of the liquid medium. This reasoning based on physical knowledge makes the generated result no longer an accumulation of visual elements, but a digital simulation of the physical world.

An even more profound structural difference is that OpenAI's current system has a significant "text information bottleneck". When you call the drawing function in ChatGPT, the user's short command is usually rewritten by GPT into a detailed prompt and then forwarded to the image - generation model.

This process seems to enrich the details, but actually leads to noise. Text, as a one - dimensional linear information carrier, has an inherent weakness in describing three - dimensional spatial relationships, topological structures, and complex object - attribute bindings. The rewriting process can easily lead to the key conditions in the original intention being covered by decorative words, resulting in information loss.

Moreover, Chinese is a nightmare for large image - generation models. The GPT - 4o has long been a "chaos generator" when writing text. Even if you ask it to write "OpenAI", it may write "OpanAl" or a series of strange symbols.

I asked the GPT - 4o to generate a sign for the letter list based on the logo of the letter list.

The Nano Banana Pro, on the other hand, enables precise control over text. Under the same prompt, the Nano Banana Pro extracts the letter list at the top, A and Z on both sides, and the arc line at the bottom, and places these elements on different layers and with different materials.

The Nano Banana Pro uses a native multimodal framework, a solution with a unified model.

The user's input is directly mapped in the model to a high - dimensional vector that contains semantic, spatial, and physical attributes. There is no need to go through a "text - image" translation mediation. This end - to - end mapping is like an architect who works directly according to a blueprint instead of relying on the oral communication of an interpreter. This eliminates the increase in entropy in the intermediate stage.

However, this also leads to another problem: The threshold for prompts becomes higher. Let's go back to the initial prompt for the three apples.

This is the prompt for the GPT - 4o. It is easy to understand and simply describes the image composition.

And this is the prompt for the Nano Banana Pro. It looks like Python code, and the generated images are controlled through functions and ().

When it comes to precisely controlling tasks such as counting, spatial arrangement, and binding of attributes of multiple objects (attribute binding), the Nano Banana Pro excels. It can clearly distinguish the attribute associations of different objects and avoids the problem of "attribute leakage" that often occurs in diffusion models (e.g., the color of a red cup is wrongly transferred to a blue cup).

Naturally, the GPT - 4o still has its own niche. Its strengths lie in inference speed and aesthetic intuition after optimization by RLHF (Reinforcement Learning from Human Feedback).

Since the complex logical reasoning steps are omitted, its generation efficiency is higher, and it can better meet the visual preferences of the masses for highly saturated and dramatic lighting conditions. For general scenarios where visual impact rather than logical rigor is the goal, the GPT - 4o is still an efficient choice.

However, when the requirements change from "beautiful" to "accurate" and from "correlation" to "causality", the "think first, then act" pattern represented by the Nano Banana Pro has an advantage. It sacrifices some generation speed and the attractive filtering effect to obtain a faithful reproduction of physical logic.

02

An orange that grows south of the Huaihe River is an orange; north of the river, it becomes a trifoliate orange. The reason for the difference between the Nano Banana Pro and the GPT - 4o is that their developers, Google and OpenAI, have chosen two completely different development paths in the field of AI.

Google has chosen the path of "native multimodality".

From the first day of model training, text, image, video, and audio are thrown together into the same neural network for it to learn. From the perspective of Gemini, there is essentially no difference between these things; all are data. It doesn't have to first translate the image into text and then understand the text.

It's like a person who speaks Chinese, English, and French from childhood. These three languages exist simultaneously in his mind, and he doesn't have to first translate English into Chinese to think.

OpenAI, on the other hand, takes the path of "modular composition".

The logic is that experts are responsible for specific tasks. The GPT - 5 is responsible for language understanding and logical reasoning, the GPT - 4o for image generation, and the Whisper for speech processing.

Each module works well, and then they are connected via API. It's like a team that has copywriters, designers, and programmers. Each has its own task, and they work together through meetings and documents.

There is no absolute right or wrong between these two paths, but they lead to completely different results.

Google's greatest strength comes from YouTube. This is the world's largest video library, containing billions of hours of video content. These videos are not static images but dynamic data containing time series, causal relationships, and physical changes. Gemini "has grown up with these videos from the start".

In other words, Gemini understands the basic operating logics of the physical world from its birth. A glass that falls to the ground breaks, and water poured into a glass forms a surface. These things were not learned through text descriptions but were summarized by itself by watching videos of the real world.

So, if you ask the Nano Banana Pro to draw "the moment when a glass falls from a table", it won't draw a rigid glass floating in the air. It will draw the tilt angle of the glass during the fall, the shape of the splashing water in the glass, and even the disturbance of the air around the glass just before impact. Because it has seen so many such scenarios, it understands how the real world works.