HomeArticle

He Kaiming and Xie Saining are the authors. Google DeepMind has launched Vision Banana: Image generators are generalist visual learners.

账号已注销2026-04-24 19:54
Vision Banana: Generation is Understanding

For a long time, the mainstream representation learning methods in the field of computer vision, such as supervised discrimination, contrastive learning, self - bootstrapping, and auto - encoding, have almost nothing to do with generative modeling. Although early generative visual pre - training showed an increasing trend with the increase of scale, the overall effect has always lagged behind non - generative methods.

Meanwhile, image and video generation models have demonstrated amazing synthesis capabilities in the past year and occasionally shown signs of zero - shot visual understanding. Therefore, a long - standing conjecture has once again attracted attention: Can a model that can “create” visual content also have the ability to “understand” visual content? Previous related attempts either made it difficult for the generative model to output quantifiable results according to instructions or required adding special modules and full - scale fine - tuning, thus sacrificing generality.

To answer this question, the Google DeepMind team launched Vision Banana. This is a general visual model based on Nano Banana Pro (NBP) and built through lightweight instruction fine - tuning. It is worth noting that scholars such as He Kaiming and Xie Saining also participated in the signature. This work represents to some extent the latest judgment of the research team on the direction of general visual foundation models.

Paper link: https://arxiv.org/pdf/2604.20329

The core conclusion is straightforward: By mixing a very low proportion of visual task data into the original training data of NBP and uniformly re - parameterizing the outputs of all visual tasks as RGB images, the model can reach or exceed dedicated models such as SAM 3, Depth Anything 3, and Lotus - 2 on multiple benchmarks for 2D and 3D visual understanding, while retaining the original image generation ability.

Vision Banana: Generation is Understanding

The method of Vision Banana is inspired by the training paradigm of large language models (LLMs). In natural language processing, generative pre - training produces a “foundation model”, and instruction fine - tuning guides the model to generate text according to specific instructions and formats. The research team applied this idea to the visual field: using an image generation model as the “foundation model” and making it generate visual outputs in the specified format according to the prompt through instruction fine - tuning.

Figure | The research team revealed the potential visual understanding ability of the image generator through instruction fine - tuning of Nano Banana Pro. The instruction - fine - tuned model Vision Banana can generate visual results in an accurate format, thus supporting evaluation on mainstream benchmarks.

1. Reconstruct Visual Tasks as Image Generation

This is the core innovation of the entire method. Whether it is a segmentation mask, a depth map, or surface normals, the outputs of visual tasks are uniformly parameterized as RGB images. The specific approach is to design a “decodable visualization scheme” so that the generated results can be recognized by the human eye and can be reversely restored to physical quantities or semantic labels through clear rules.

Taking semantic segmentation as an example, the prompt given to the model by the research team is “Segment the skateboard category in pure yellow <255, 255, 0>”. During evaluation, simply cluster all pixels close to <255, 255, 0> to obtain the mask of the skateboard.

This strategy brings three key advantages: a unified model can support multiple tasks, only the prompt needs to be adjusted without modifying the weights; the demand for new training data is extremely low, and instruction fine - tuning mainly teaches the model how to format visual results as RGB outputs; at the same time, the original image generation ability is retained because the output is still essentially an RGB image.

2. Lightweight Instruction Fine - Tuning Strategy

The research team mixed visual task data into the original training data of Nano Banana Pro at a very low proportion for joint training. The low - proportion mixing can ensure that the alignment of visual tasks does not destroy the existing generative priors of the model.

The 2D task suite includes referential expression segmentation, semantic segmentation, and instance segmentation; the 3D tasks focus on monocular metric depth estimation and surface normal estimation. For training data, the 2D tasks use annotations generated by an internal model for network images, and the 3D tasks use synthetic data generated by a rendering engine.

The key is that the training data corresponding to all evaluation benchmarks are not included in the instruction - fine - tuning mixed data, so the results can more truly reflect the general generalization ability of the model.

3. Reversible Bijection from Depth Values to RGB

Depth estimation is the part with the most concentrated technical details in the paper. The range of depth values is [0, ∞), and the value range of RGB is [0, 1]^3. The core problem is how to establish a reversible mapping between them.

The research team first performs a power transform on the depth values, increasing the resolution of short - distance depths while compressing the resolution of long - distance depths, which also conforms to the intuition that nearby objects are more important in tasks such as robot grasping. Then, the normalized distance values are linearly interpolated in segments along the edges of the RGB cube, similar to the first iteration of the 3D Hilbert curve.

Since both of these transformations are strictly reversible, a bijective mapping from [0, ∞] to [0, 1]^3 is finally formed. During the training phase, the ground - truth depth is mapped to RGB as the supervision target; during the inference phase, reverse decoding is performed to restore the metric depth.

To improve robustness, the training data also includes various alternative color maps such as Plasma, Inferno, Viridis, and grayscale for enhancement. It is worth noting that this depth model is completely trained based on synthetic data without using any real - world depth data, and neither the training nor the inference process depends on the camera's internal and external parameters.

How about the Effect?

The research team comprehensively evaluated Vision Banana against expert models in each field on three types of tasks: 2D segmentation, 3D depth estimation, and surface normal estimation. The results are as follows:

Figure | The performance of Vision Banana in visual generation and understanding tasks after instruction fine - tuning.

2D Segmentation: In the Cityscapes semantic segmentation task, the mIoU of Vision Banana reaches 0.699, an increase of 4.7 points compared to 0.652 of SAM 3, making it the most powerful open - vocabulary model. In the RefCOCOg referential segmentation task, the cIoU reaches 0.738, exceeding 0.734 of SAM 3 Agent. In the ReasonSeg inference segmentation task, when combined with Google's Gemini 2.5 Pro, the gIoU reaches 0.793, higher than 0.770 of SAM 3 Agent and exceeding X - SAM and LISA trained on the training set. Instance segmentation is the only slightly weaker item, with a pmF1 of 0.540 on SA - Co/Gold, slightly lower than 0.552 of DINO - X.

Table | The comparison results between Vision Banana and the SOTA methods on each segmentation dataset.

3D Depth Estimation: The average δ1 accuracy on 6 mainstream benchmarks reaches 0.882, an increase of nearly 6 points compared to UniK3D, and the AbsRel decreases by about 20% compared to MoGe - 2. On the four datasets (NYU, ETH3D, DIODE, KITTI) used in the Depth Anything 3 evaluation, the average δ1 of Vision Banana is 0.929, better than 0.918 of Depth Anything 3.

Table | The monocular metric depth estimation results under the zero - shot transfer setting. Vision Banana achieves better results on public datasets without using camera internal parameters during both training and inference.

Surface Normal Estimation: On three indoor datasets, Vision Banana achieves the lowest average angular error, with a mean of 15.549 and a median of 9.300, better than the mean of 16.558 of Lotus - 2. In the outdoor VKitti scenario, its performance is on par with Lotus - 2. It is worth noting that Lotus - 2 was trained on Virtual KITTI 2, while Vision Banana strictly maintains the zero - shot setting.

Table | The surface normal estimation results. Vision Banana achieves the lowest mean and median angular errors on indoor datasets on average and is on par with the previous SOTA methods in outdoor scenarios.

Retention of Generation Ability: In the GenAI - Bench text - to - image comparison, the winning rate of Vision Banana relative to the base model Nano Banana Pro is 53.5%; in the ImgEdit image editing task, the winning rate is 47.8%. This indicates that after lightweight instruction - tuning, the generation ability of the model remains stable.

What Else Needs to be Done?

The research team said that Vision Banana is not perfect and needs continuous improvement in future work.

For example, the instance segmentation performance of Vision Banana still lags behind SAM 3, and there is still a gap on the SA - Co/Gold dataset. The paper points out that part of the reason is that Vision Banana does not include SA - Co in the training data, while SAM 3 is trained based on this data. At the same time, this task itself also poses challenges to the class - based reasoning strategy.

Computational overhead is also one of the current limitations. The research team pointed out that at present, using an image generator of the NBP scale for visual understanding has a higher inference cost than lightweight dedicated models. If a generative visual framework is to be deployed on a large scale, further improvement in speed and reduction in cost are still needed.

The current evaluation scope is limited to monocular image input. In the future, it can be extended to multi - view input and video input. Exploring whether video generators can learn more abundant time - aware representations is also considered a worthy direction. Expanding the diversity of instruction - tuning tasks may release stronger cross - task generalization ability like LLMs. In addition, the collaborative integration of the basic visual model and the large language model for enhancing cross - modal reasoning is also an important direction in the next stage.

From a more macroscopic perspective, this work attempts to introduce the paradigm of “pre - training produces a general base, and instruction - tuning aligns the base to specific tasks” in the LLM era into the visual field. If image generation can become a general interface for vision, then the two relatively independent research routes of “generation” and “understanding” may converge into the same basic visual model in the future.

This article is from the WeChat official account “Academic Headlines” (ID: SciTouTiao). Author: Academic Headlines. Republished by 36Kr with permission.