HomeArticle

The visual ability of the most powerful model is inferior to that of a 6-year-old child.

量子位2026-01-22 21:05
Visual reasoning relying solely on language won't work.

Who would have thought?

In the field of visual reasoning, large models are still as inexperienced as a three - year - old child.

The latest research from multiple research institutions such as UniPat AI, xbench, Alibaba, Dark Side of the Moon, and Step by Step Stars shows that:

On the BabyVision visual reasoning benchmark, the currently strongest - performing Gemini 3 Pro Preview only barely beats three - year - old children, and there is still a 20% gap compared to six - year - old children.

Compared with the 94.1 level of adults, it is a world of difference.

More importantly, Gemini 3 Pro Preview is already the "ceiling" among current models.

Other cutting - edge models, including GPT - 5.2, Claude 4.5 Opus, Grok - 4, etc., have an overall performance even worse than that of three - year - old children.

This heart - wrenching conclusion undoubtedly pours cold water on the current embodied intelligence based on VLA(M).

After all, it is hard to expect an AI whose visual ability has not reached the level of a three - year - old child to assist humans stably and safely in the real physical world.

It is precisely in this sense that BabyVision also provides another perspective:

To truly advance multimodal intelligence, future models must reconstruct visual abilities from the ground up, rather than continuing to rely on translating visual problems into language to "circumvent" them.

The Linguistic Bottleneck of Visual Reasoning

In the comprehensive evaluation, the research compared the performance of open - source and closed - source models:

Among the closed - source models: Gemini 3 - Pro - Preview leads with a score of 49.7%, followed by GPT - 5.2 (34.4%) and Doubao - Seed - 1.8 (30.2%).

The performance of the remaining models is not satisfactory: Qwen3 - VL - Plus 19.2%, Grok - 4 16.2%, Claude - 4.5 - Opus 14.2%.

Among the open - source models, the best - performing one is Qwen3VL - 235B - Thinking, with a total score of 22.2%.

Among them, the Thinking version of Qwen3VL is better than the Instruct version, which indicates that explicit reasoning can reduce visual uncertainty.

In addition, even the largest open - source models still cannot match the top - tier closed - source systems.

So, the question arises.

Why do large models that demonstrate a doctor - level "IQ" in high - difficulty tasks such as HLE and IMO and can even solve mathematical problems frequently fail in some seemingly simple "spot the difference" tasks?

First, the conclusion: Current multimodal large models usually first convert visual inputs into language representations for processing before reasoning.

This approach makes full use of the powerful reasoning ability of large language models, but it also introduces a fundamental limitation:

Any visual information that cannot be accurately expressed in language will be lost in this process.

For example, "a red car" in an image can be easily transcribed into text; but more fine - grained geometric information, such as the exact curvature of the boundary, the specific position of the intersection point, and the subtle changes in the relative spatial relationship, is difficult to be faithfully described in language.

It is precisely these "indescribable" visual features that constitute the core difficulty of the BABYVISION task and have become the place where current top - tier multimodal models generally fail.

Specifically, BabyVision breaks down visual reasoning into four core ability dimensions:

Fine - grained Discrimination: Detecting subtle visual differences

Visual Tracking: Tracking paths, lines, and motion trajectories

Spatial Perception: Understanding three - dimensional structures and spatial relationships

Visual Pattern Recognition: Recognizing logical and geometric patterns in vision

Based on the above ability dimensions, the research summarizes four classic core visual challenges currently faced by MLLM, as follows:

The Lack of Non - Verbal Fine Details

First is the lack of non - verbal fine details, which are often difficult to be accurately described in language.

For example, when faced with a small offset, a specific boundary curve, or just a one - pixel difference, multimodal large models (MLLMs) often treat these completely different options as similar.

Taking the best - performing Gemini 3 Pro Preview as an example, in the following puzzle - finding task, it incorrectly chose Option D.

(Correct answer: B)

In Gemini's reasoning process, it first converts shapes into text descriptions, then simplifies them into rough features (such as quantity, topological structure), and then compares candidate options in the language space.

In contrast, humans complete the task instantly through direct shape matching. The human brain translates and rotates each candidate option, checks whether the boundaries are aligned, and the whole process is directly driven by geometry without going through text.

So, the key here is not the difficulty of logic, but the lack of high - fidelity perception.

The Loss of Manifold Identity

In addition, the research also found that multimodal large models have difficulty maintaining reliable perceptual consistency in long - distance spaces.

For example, in the following line - connecting task, Gemini 3 Pro Preview failed again, incorrectly connecting the plastic bottle to the green trash can and the apple core to the blue trash can.

(Correct answer: Plastic bottle - blue, Test paper - yellow, Apple core - green)

The research found that when solving the problem, Gemini usually breaks a continuous curve into a series of simple instructions, such as left, right, up, and down.

However, the problem is that once there is an intersection point, this way of breaking down will make the path ambiguous and easy to go wrong.

Since the model does not "really remember" what the curve looks like in its mind, it may accidentally switch to another line after passing through the intersection point.

This kind of error is almost obvious to humans at a glance, but it is difficult to detect once the information is compressed into text.

In contrast, humans generally directly follow a line to the end. And this ability is naturally developed in human infants.

Spatial Imagination

The third common challenge found in the research is "spatial imagination", that is, constructing a stable three - dimensional internal representation from a two - dimensional image and performing mental transformations on it while keeping the structure unchanged -

Such as switching perspectives, projecting outlines, or inferring occluded volumes.

For example: Given a view, imagine what it should look like from the side.

In this task, Gemini 3 Pro Preview still chose the incorrect Option C.

(Correct answer: A)

In Gemini's reasoning process, the model first converts the visual scene into a language summary, describes objects in words, and then "guesses" two - dimensional features based on these words.

But the problem lies here - text narration cannot faithfully represent the spatial state.

Once the precise image is compressed into a vague text summary, the model is likely to make predictable mistakes: missing occluded building blocks, miscounting the number of layers, or using the wrong three - dimensional projection relationship.

In contrast, humans can directly "rotate" the object in their minds from a specified direction and make comparisons, and the whole process hardly requires language participation.

Visual Pattern Induction

The fourth challenge is visual pattern induction: that is, summarizing general change rules from a small number of visual examples and applying them to new inputs.

In the following pattern - finding problem, QWEN3 - VL - PLUS chose the incorrect Option B.

(Correct answer: C)

The common approach of the model in this kind of task is not to understand "what changes have occurred" but to count attributes.

For example, how many colors there are, how many shapes there are, and whether the elements are similar. It describes the source image and the target image, and then tries to "match" the two at the text level.

In contrast, when humans deal with this kind of problem, they usually directly compare the previous and subsequent visual examples and form a simple "causal diagram" in their minds:

Which shape contains which shape? Who is the frame and who is the content? How are these roles redistributed from input to output?

It is this ability to abstractly reason about visual relationships - rather than simple recognition - that constitutes a threshold that the current model architecture still has difficulty crossing.

Visual Reasoning Based on RLVR and Generative Modeling

So, since visual reasoning based on text (such as VLM) has natural limitations, is there a way to improve this?

In response, the research provides two directions: Reinforcement Learning with Verifiable Rewards (RLVR) and visual reasoning based on generative models.

First, let's look at RLVR.

Specifically, the research uses Qwen3 - VL - 8B - Thinking as the base model and conducts RLVR fine - tuning on it.

The experiment shows that after completing the RLVR fine - tuning, the overall accuracy of the model has increased by about 4.8 percentage points. From the distribution of task sub - categories, most categories have shown varying degrees of improvement.

This is consistent with the insights obtained from the Qwen reasoning model: Once the visual signal is extracted, explicit intermediate reasoning can partially offset visual uncertainty.

Next is the generative model method.

Since there is a natural "information distortion" in using language to carry visual reasoning, can the model imitate humans and complete reasoning through "visual reconstruction" -

That is, directly performing calculations in the pixel space (such as drawing lines or completing patterns).

Based on this understanding, the research launches BabyVision - Gen and evaluates the performance of three cutting - edge visual generative models: NanoBanana - Pro, GPT - Image - 1.5, and Qwen - Image - Edit on it.

(Note: BabyVision - Gen selects 280 questions suitable for generative interaction from the full - scale benchmark and requires the model to directly output images or video streams to express the problem - solving process)

The experimental results show that NanoBanana - Pro performs the best, with an accuracy rate of 18.3%; while GPT - Image - 1.5 and Qwen - Image - Edit are 9.8% and 4.8% respectively.

Although the success rate is still not high, the research believes that models such as NanoBanana - Pro and Sora - 2 have shown explicit visual thinking and can generate