OpenAI Can't Learn Nano Banana

OpenAI can't replicate the Nano Banana Pro, but they can have GPT-5o.

Altman sent an internal letter to all employees of OpenAI. He candidly stated that although OpenAI still holds a leading position, Google is narrowing the gap. Altman also admitted that it is Google's recent series of product releases that have brought considerable pressure to OpenAI.

As Altman said, this time Google has brought not only the well - received Gemini 3 Pro, but also the Nano Banana Pro that has shocked the entire AIGC circle. Before this, the underlying logic of all image - generation models was to imitate the world. Through a massive database, they would find the images closest to the description and piece them together for you.

The emergence of Nano Banana Pro has completely broken this rule. It is not "drawing pictures" but "simulating the physical world". Its greatest breakthrough lies in the introduction of the Chain of Thought reasoning mechanism, which allows the model to think first and then draw pictures.

Before placing the first pixel, the model will first conduct logical deductions in the latent space, calculate the number of objects, determine the projection angle of light and shadow, and plan the spatial nesting relationship. It no longer relies on text as an intermediary, and the reasoning results directly guide pixel generation in the form of high - dimensional vectors.

Then the question arises: Why can't OpenAI develop Nano Banana Pro?

01

Before answering the question, let's first take a look at Nano Banana Pro and see what the differences are between it and GPT - 4o, which is currently mainly used by OpenAI for image generation.

Take the generation task of "three apples" as an example. The prompt is: "The left apple has a bite mark, the middle apple has water droplets on it, and the right apple is in a rotten state." Facing this instruction, GPT - 4o usually quickly generates a brightly colored and perfectly composed image.

However, when checking the details, it often exposes the defects of probability generation. The arrangement of the water droplets on the middle apple does not conform to objective laws, and the rot on the right apple looks too deliberate.

In contrast, the images output by Nano Banana Pro not only have the exact number of objects, but also strictly correspond to the attributes of each object - the notch on the left, the refracted light on the middle, and the oxidation texture on the right are all accurately restored.

Behind this apparent difference are two completely different technical paths.

The generation mechanism of GPT - 4o is essentially based on statistical correlation. It retrieves the visual features of "apple + bite mark" in a massive amount of training data and pastes and fuses them through probability distribution. It does not really understand the numerical concept of "three" nor does it build a physical model of "rot". It only performs approximate matching based on the feature distance in the high - dimensional space.

Nano Banana Pro introduces the Chain - of - Thought (CoT) mechanism, upgrading the image generation process from simple "pixel prediction" to "logical deduction". Before placing the first pixel, the model has completed a round of symbolic planning internally: first establish the entity objects (Object 1, 2, 3), then assign spatial coordinates, and finally bind physical attributes.

For the "bite mark", it deduces the change in geometric form; for the "water droplets", it calculates the physical laws of optical reflection and refraction; for the "rot", it simulates the evolution of material properties. This is a full - link closed - loop from semantic understanding to logical planning and then to execution and generation.

This mechanism shows its advantages especially when dealing with complex scenarios involving physical laws.

The prompt is "A half - glass of water on the windowsill, with sunlight shining in from the left."

The picture generated by GPT - 4o only has visual rationality but a self - contradictory light and shadow relationship in physics. At this time, there should be sunlight reflected by the glass on the left side of the windowsill, but there is only light refracted from the right side in the picture.

Nano Banana Pro will first calculate the light source vector, deduce the shadow projection direction and the light refractive index of the liquid medium. This kind of reasoning based on physical common sense makes the generated result no longer a pile of visual elements but a digital simulation of the physical world.

A more profound architectural difference is that OpenAI's current system has a significant "Text Information Bottleneck". When using the drawing function in ChatGPT, the user's short instructions are often rewritten by GPT into a detailed Prompt and then passed to the image generation model.

This process seems to enrich the details, but in fact, it introduces noise. As a one - dimensional linear information carrier, text has a natural low - bandwidth defect when describing three - dimensional spatial relationships, topological structures, and complex object attribute bindings. The rewriting process can easily cause the key constraints in the original intention to be submerged by the decorative language, resulting in lossy information transmission.

In addition, Chinese characters are also a nightmare for large image - generation models. For a long time, GPT - 4o has been a "garbage code generator" when writing. Even when asked to write "OpenAI", it can write "OpanAl" or a bunch of strange symbols.

I asked GPT - 4o to generate a signboard for the Alphabet List with the Alphabet List LOGO as a reference.

However, Nano Banana Pro can achieve precise control over text. With the same prompt, Nano Banana Pro extracts the Alphabet List above, the A and Z on the left and right sides, and the arc at the bottom, and places these elements on different layers and with different materials.

Nano Banana Pro adopts a Native Multimodal architecture, which is a unified model solution.

The user's input is directly mapped to high - dimensional vectors containing semantic, spatial, and physical attributes inside the model, without the need for a "text - image" translation intermediary. This end - to - end mapping relationship is like an architect constructing directly according to the blueprint, rather than relying on the oral communication of a translator, thus eliminating the information entropy increase in the intermediate link.

But this also causes another problem: the threshold for prompts has been raised. Let's go back to the prompt for the three apples at the beginning.

This is the prompt input for GPT - 4o. It is simple and easy to understand, just describing the composition of the picture.

And this is the prompt for Nano Banana Pro. It looks like Python code, controlling the generated picture through functions and ().

In tasks involving precise control such as counting, orientation layout, and multi - object attribute binding (Attribute Binding), Nano Banana Pro performs excellently. It can clearly distinguish the attribute attribution of different objects, avoiding the common "attribute leakage" problem of diffusion models (such as incorrectly rendering the color of a red cup onto a blue cup).

Of course, GPT - 4o still has its unique niche. Its advantages lie in inference speed and aesthetic intuition tuned based on RLHF (Reinforcement Learning from Human Feedback).

Since it has stripped away the complex logical reasoning link, its generation efficiency is higher, and it can better meet the public's visual preference for high - saturation and dramatic light and shadow. For general scenarios that pursue visual impact rather than logical rigor, GPT - 4o is still an efficient choice.

However, when the demand shifts from "good - looking" to "accurate" and from "correlation" to "causality", the "think first, then execute" mode represented by Nano Banana Pro becomes a game - changer. It sacrifices some generation speed and the filter effect that pleases the eye in exchange for a faithful restoration of physical logic.

02

As the saying goes, oranges grown south of the Huai River are oranges, while those grown north of it are trifoliate oranges. The reason why there is such a gap between Nano Banana Pro and GPT - 4o is that their developers, Google and OpenAI, have chosen two completely different development directions on the AI path.

Google has chosen the path of "Native Multimodality".

From the first day of model training, text, images, videos, and audios are mixed together and put into the same neural network for it to learn. In the eyes of Gemini, these things are essentially the same, all just data. It doesn't need to translate pictures into text first and then understand the text.

This is like a person who can speak Chinese, English, and French from childhood. These three languages exist simultaneously in his mind, and he doesn't need to translate English into Chinese before thinking.

OpenAI, on the other hand, has taken the path of "modular splicing".

Its logic is to let professionals do professional things. GPT - 5 is responsible for language understanding and logical reasoning, GPT - 4o is responsible for image generation, and Whisper is responsible for speech processing.

Each module is well - developed, and then they are connected through APIs. This is like a team with copywriters, designers, and programmers, each performing their own duties and collaborating through meetings and documents.

There is no absolute right or wrong in these two paths, but they will lead to completely different results.

Google's greatest advantage comes from YouTube. This is the world's largest video library, containing billions of hours of video content. These videos are not static pictures but dynamic data containing time series, causal relationships, and physical changes. Gemini has "grown up watching these videos" from the very beginning.

In other words, since its birth, Gemini has understood the basic operating logic of the physical world. A cup will break when it falls on the ground, and water will form a liquid surface when poured into a cup. These things are not learned from text descriptions but are summarized by itself through watching videos of the real world.

So when you ask Nano Banana Pro to draw "the moment a cup falls off the table", it won't draw a cup floating in the air with a rigid posture. It will draw the tilt angle of the cup during the fall, the shape of the splashing water in the cup, and even the disturbance of the air around the cup when it is about to touch the ground. Because it has seen too many such scenarios and knows how the real world works.

In addition to YouTube, Google has another moat: OCR. Google has been doing optical character recognition for decades. From Books to Lens, Google has accumulated the world's largest "image - text" alignment database. This directly results in Gemini's overwhelming advantage in text rendering.

It knows what Chinese characters should look like in pictures and how text should be presented in different fonts, sizes, and arrangements. This is why Nano Banana Pro can accurately recognize Chinese characters.

In contrast, OpenAI started with text. From GPT - 1 to GPT - 3 and then to GPT - 5, it has made rapid progress in language models and has indeed reached the world - class level. However, its visual ability was added later.

DALL - E developed independently in the early stage, and its training data mainly came from static pictures crawled from the Internet, such as datasets like Common Crawl. The quality of these pictures is uneven, and they are all static, without a time dimension, physical process, or causal relationship.

So what DALL - E has learned is more about "what this thing generally looks like" rather than "why this thing looks like this" or "how this thing will change". It can draw a very beautiful cat, but it doesn't understand the cat's skeletal structure, how the cat's muscles move, or what posture the cat will take when jumping. It has just seen many cat photos and learned that "a cat looks like this".

More importantly, there is a difference in the training methods.

Since OpenAI has taken the RLHF route, they have hired a large number of human annotators to score the generated pictures: "Is this picture good - looking?" "Does this picture better meet the requirements?" When making choices, the annotators naturally tend to those pictures with bright colors, perfect composition, smooth skin, and dramatic light and shadow.

This has led GPT - 4o to be trained as a "people - pleaser" painter. It has learned how to draw eye - catching pictures, how to use high contrast and saturated colors to attract attention, and how to make the skin as smooth as porcelain. But the price is that it has sacrificed physical realism.

Google

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Nano Banana, you can't learn it, OpenAI.

01

02