The Arrival of the Visual GPT Era: DeepMind Uses Vision Banana to Prove "Generation Equals Understanding" with HE Kaiming and XIE Saining Involved

Are generation and understanding completely connected?

A couple of days ago, OpenAI's ChatGPT Images 2.0 amazed the world. Its performance in actual tests has generally exceeded that of the previous SOTA, Nano Banaba Pro.

While people were still in awe of the excellent capabilities of AI image generation, Google DeepMind released a heavyweight paper titled "Image Generators are Generalist Vision Learners", which systematically proved the intuition that many people had before: Image generators are powerful generalist vision learners.

Why rely on dedicated models to understand the physical world?

Paper title: Image Generators are Generalist Vision Learners

Paper address: https://arxiv.org/abs/2604.20329v1

Project address: https://vision-banana.github.io/

Google DeepMind's research found that similar to how generative pre - training in LLMs enables models to develop language understanding and reasoning abilities, image generation training can enable models to learn powerful and general visual representations, thus achieving SOTA performance in various visual tasks.

Based on this discovery, they also built a general model Vision Banana based on Nano Banana Pro and achieved quite remarkable results, comparable to or even surpassing zero - shot domain expert models, such as the Segment Anything Model 3 for segmentation tasks and the Depth Anything series for depth estimation.

A tweet shared by the author, Shangbang Long

This research is of great significance as it shows that image generation can serve as a unified general interface for visual tasks. DeepMind also stated in the paper: "We may be witnessing a major paradigm shift in the field of computer vision, where generative visual pre - training plays a central role in building fundamental visual models that support both generation and understanding."

This paper was jointly completed by multiple core authors and contributors. Additionally, we can see familiar names such as Xiesaining and He Kaiming. Xiesaining posted several tweets, emphasizing the rise and superiority of general models: A single multimodal general model like Vision Banana has for the first time defeated top - notch domain - specific models such as SAM3 and DepthAnything3 in low - level perception tasks like image segmentation and edge detection. Perception tasks that were previously regarded as different problems can now be completed under a unified system with simple prompts.

Now let's take a detailed look at this heavyweight research result.

Research background: The conjecture that generation implies understanding has a long history

In the field of AI research, there has long been an intuition that a model capable of creating visual content should also be able to understand visual content. After all, if a model cannot deeply understand the shape, semantics, and spatial relationships of objects, how can it generate such high - fidelity and semantically accurate images?

However, the reality clearly deviates from this intuition. For a long time, the mainstream methods in the field of visual representation learning do not belong to the generative modeling family. Instead, supervised discriminative learning, contrastive learning, bootstrapping, and auto - encoding methods have been dominant. Although early explorations of generative visual pre - training showed promising scalability, their effects have always lagged behind non - generative models.

In the field of natural language processing, this situation has long been broken.

The GPT series of models have proven that generative pre - training (i.e., making the model predict the next token) can enable LLMs to develop powerful language understanding and reasoning abilities. After instruction fine - tuning, the models can achieve SOTA performance in various tasks.

The researchers at DeepMind couldn't help but wonder: Can image generation play a role similar to text generation? Are image generators generalist vision learners?

Core method: "Disguise" all visual understanding tasks as drawing tasks

The Vision Banana proposed in this paper is based on the image generation model named Nano Banana Pro (NBP).

The research team did not add any complex network structures specifically for visual understanding (such as detection and segmentation) to this generation model, nor did they modify the underlying architecture. Their method is extremely ingenious - parameterize the output space of all visual perception tasks into the RGB image format. Specifically, they incorporated a small amount of visual task data into the original image generation training data for lightweight instruction fine - tuning.

To teach the model to understand instructions and directly "draw" the results of visual tasks, Vision Banana implements image - based output decoding. For example, in semantic segmentation, the prompt might specify "draw the skateboard in pure yellow <255, 255, 0>", and the model will directly generate an RGB image with a color mask. Then, by simply extracting the pixels of the corresponding color, the segmentation result can be perfectly restored.

When performing 3D depth estimation, they designed a strictly reversible mathematical mapping mechanism (using power - law transformation) to map the metric depth from 0 to infinity in the physical world to the edges of the RGB color cube. The model outputs a gradient "pseudo - color map", which can be directly converted into an accurate physical depth distance after decoding.

Through this method of using drawing to solve problems, a unified Vision Banana model has defeated or tied with a number of current top - notch professional models in 2D and 3D visual understanding tasks:

Exquisite color mapping for depth estimation

Among all visualization schemes, the RGB encoding design for depth estimation is the most elaborate and worthy of separate discussion.

The range of depth values is [0, ∞), while the range of RGB values is bounded at [0, 1]^3. Establishing a bijection between the two is the core challenge in engineering design.

The researchers used a power transformation to "bend" the depth values, mapping the original depth to a normalized distance in the [0, 1) interval, and then performed linear interpolation along the edges of the RGB cube - this path is similar to the first iteration of a three - dimensional Hilbert curve, traversing the edges of the cube from black to white. Since both the power transformation and linear interpolation can be strictly inverted, the entire mapping forms a perfect bijection from metric depth to the RGB space, and the color image generated by model inference can be losslessly decoded back to the exact metric depth value.

In addition, the research team specifically gave higher color resolution to near - field objects - because for applications such as robot operation and depth sensing, accurate measurement of nearby objects is often more critical than distant views.

Surface normal estimation

Compared with depth, the visualization scheme for surface normal vectors is much more natural. A surface normal vector consists of three components (x, y, z) with a value range of [-1.0, 1.0], which is naturally aligned with the RGB color channels. The researchers used a right - hand coordinate system (+x to the right, +y upward, +z outward) and directly mapped the three direction components to the R, G, and B channels: Surfaces facing left appear pinkish, those facing upward appear light green, and those facing the camera appear light blue/purple.

This inherent alignment makes normal vector estimation almost require no additional design and can directly use the native capabilities of the generative model.

Experimental results: Comprehensive surpassing of zero - shot expert models

2D understanding: Segmentation tasks

In terms of semantic segmentation, Vision Banana outperformed SAM 3 with an mIoU of 0.699 compared to 0.652 on the Cityscapes dataset (19 classes of urban scenes), leading all zero - shot transfer methods and further narrowing the gap with closed - set dedicated models (such as SegMan - L).

In terms of instance segmentation, Vision Banana adopted a "class - by - class inference" strategy to address the challenge of unknown instance numbers: Each inference is only for one class, allowing the model to automatically assign colors to different instances dynamically. After inference, each instance mask is decoded through color clustering. On the SA - Co/Gold dataset, Vision Banana's pmF1 is 0.540, which is basically on par with DINO - X (0.552) and far exceeds methods such as Gemini 2.5 (0.461) and OWLv2 (0.420).

Referring Expression Segmentation is the task that best reflects the deep integration of language and vision - the model needs to understand free - form natural language queries and accurately segment the corresponding targets accordingly.

Vision Banana performed particularly well in this task: It achieved a cIoU of 0.738 on the RefCOCOg dataset (UMD validation set) and a gIoU of 0.793 on the ReasonSeg validation set, both surpassing SAM 3 Agent (0.734 / 0.770). Even more surprisingly, when used in combination with Gemini 2.5 Pro, Vision Banana can even surpass some non - zero - shot methods that have been fully trained on the training set on ReasonSeg. The researchers observed that the multimodal intelligence inherited by Vision Banana from generative pre - training enables it to more effectively reason about "what to segment", which is an advantage that discriminative models can hardly match.

3D understanding: Depth and normal vector estimation

Monocular metric depth estimation is a well - recognized difficult problem in 3D understanding: 2D projection irreversibly loses three - dimensional geometric information, and the difficulty is even greater in the monocular setting without multi - view disparity clues. Existing SOTA methods (such as Depth Anything V3, UniK3D, MoGe - 2) usually need to introduce camera intrinsics during the training or inference stage to resolve inherent ambiguities and are equipped with specially designed architectures and loss functions.

Vision Banana's strategy is completely different: It does not use camera parameters at all (both during training and inference) and purely relies on the geometric priors about object size and distance relationships learned by the base model during large - scale image generation pre - training to infer the absolute scale. More notably, all training data comes from synthetic rendering engines, without using any real - world depth data, and all real training data from evaluation benchmarks is excluded.

On six public benchmarks, Vision Banana's average δ_1 accuracy reached 0.882. On four datasets (NYU, ETH3D, DIODE - indoor, KITTI) directly comparable to Depth Anything V3, the average δ_1 was 0.929, exceeding Depth Anything V3's 0.918. It is nearly 6 percentage points ahead of UniK3D, and the absolute relative error (AbsRel) is about 20% lower than that of MoGe - 2.

The researchers also conducted a quite convincing vibe test: The author of the paper took a photo near Kinkaku - ji with an ordinary smartphone. Vision Banana estimated the depth of the marked point in the photo to be 13.71 meters, while the actual distance measured by Google Maps was 12.87 meters, with an absolute relative error of only about 0.065.

In terms of surface normal vector estimation, Vision Banana achieved the lowest mean and median angular errors on the average of four public benchmarks for indoor scenes and was comparable to Lotus - 2 in outdoor scenes. Qualitative comparison shows that the normal vector maps generated by Vision Banana have significantly better visual fidelity and detail granularity than those of Lotus - 2. Even on the outdoor dataset (Virtual KITTI 2) where the quantitative indicators are slightly inferior, its visual quality is still superior.

Verification of generation ability

Will lightweight instruction fine - tuning damage the original image generation ability of Nano Banana Pro?

The research team conducted a human preference evaluation on two benchmarks, GenAI - Bench (text - to - image generation) and ImgEdit (image editing). Vision Banana's win - rate against Nano Banana Pro was 53.5% and 47.8% respectively (see Figure 1).

This result clearly shows that after instruction fine - tuning, Vision Banana's generation ability is basically on par with that of the base model, "Proficient in understanding, not forgetting generation".

A paradigm shift is taking place

The significance of this research lies not only in a set of remarkable benchmark numbers but also in the fact that it proposes and systematically verifies two profound assertions.

Firstly, image generators are generalist vision learners. By analogy with generative pre - training in the LLM field, the visual priors learned by models through image generation training not only serve generation tasks but have also been internalized into general visual understanding abilities. These generative priors can even surpass the dedicated architectures and training paradigms carefully designed for specific tasks.

Secondly, image generation is a general interface for visual tasks. Just as text generation unifies various tasks such as language understanding, reasoning, mathematics, code, and agents

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The era of Visual GPT has arrived. DeepMind uses Vision Banana to prove that "generation equals understanding," and HE Kaiming and XIE Saining are both involved.