HomeArticle

Is the video model really performing inference, or just "pretending" to do so? Researchers from The Chinese University of Hong Kong and other institutions question: Is Chain-of-Frame real?

机器之心2025-11-19 07:59
This research provides a clear and systematic empirical analysis and evaluation framework for the academic community.

In recent years, video generation models represented by Veo and Sora have demonstrated astonishing synthesis capabilities, being able to generate highly realistic and temporally coherent dynamic images. The progress of such models in visual content generation indicates that they may implicitly understand the structure and laws of the world. More notably, the latest research from Google points out that models such as Veo 3 are gradually showing "emergent properties" beyond simple synthesis, including higher - level abilities such as perception, modeling, and reasoning.

This has given rise to a new concept corresponding to the "Chain - of - Thought (CoT)" in language models, namely "Chain - of - Frame (CoF)". Its core idea is that the model solves problems step by step through a coherent visual deduction by generating videos frame by frame. However, a key question remains unanswered: Do these models truly possess the ability of zero - shot reasoning? Or are they merely imitating the surface patterns that appear in the training data?

To explore this issue, a research team from The Chinese University of Hong Kong, Peking University, and Northeastern University conducted a systematic study, deeply evaluated the zero - shot reasoning potential of models such as Veo 3, and proposed a comprehensive test benchmark, MME - CoF, covering 12 reasoning dimensions including space, geometry, physics, and time.

Paper Title: Are Video Models Ready as Zero - Shot Reasoners? An Empirical Study with the MME - CoF Benchmark

Paper Link: https://arxiv.org/pdf/2510.26802v1

Project Homepage: https://video - cof.github.io/

What is Chain - of - Frame (CoF) reasoning?

"Chain - of - Frame reasoning" can be regarded as a visual analogy to the "Chain - of - Thought (CoT)" in language:

CoT demonstrates the reasoning path by generating text step by step.

CoF reflects the deduction process by generating frames one by one, making the scene evolve visually.

In - depth analysis: 12 reasoning challenges

To comprehensively reveal the reasoning potential of video models, the research team designed test tasks in 12 dimensions and conducted a systematic empirical analysis of Veo 3. Three typical dimensions are selected below for illustration (the rest can be referred to in the original paper).

1. Real - World Spatial Reasoning

Task: Evaluate the model's ability to maintain spatial consistency in multi - perspective natural scenes, including perspective changes, orientation alignment, and reference frame stability.

Findings: It can handle the spatial layout and perspective switching in simple scenes well, maintaining reasonable spatial relationships and directional consistency in local scenes.

Limitations: It performs unstably in tasks involving complex perspective changes or in - depth understanding. Spatial misalignment, perspective drift, or directional confusion often occur, and it is difficult to maintain global coordinate consistency.

2. 3D Geometry Reasoning

Task: Evaluate the model's structural understanding and continuity performance in three - dimensional geometric transformation tasks, such as object folding, rotation, and three - dimensional reconstruction.

Findings: It can generate structurally complete and visually coherent results in single - step and simple geometric transformations, showing a preliminary understanding of three - dimensional shapes.

Limitations: Structural misalignment, self - intersection, or collapse often occur in multi - step or combined transformations. It cannot maintain geometric consistency and physical rationality, and the overall three - dimensional reasoning is still fragile.

3. 2D Geometry Reasoning

Task: Evaluate the model's accuracy and constraint - maintaining ability in plane geometric construction and graphic operation tasks, such as connecting points, moving shapes, and understanding the composition order.

Findings: It can identify and correctly draw basic relationships in simple geometric connection tasks, showing a preliminary geometric construction ability.

Limitations: It tends to generate visually appealing graphics rather than strictly geometrically compliant ones. Errors in connection order, shape deformation, or continuous drawing beyond the task scope often occur, and it lacks a stable sense of geometric constraints.

Overview of the other six reasoning dimensions

In addition to the above three dimensions, the other nine dimensions also reveal the limitations of Veo 3:

Visual Detail Reasoning: The recognition of occluded or tiny targets is unstable, and the generated content tends to deviate from the task requirements.

Visual Trace Reasoning: The long - term temporal dependence and rule - driven action chains are prone to interruption, and the causal consistency is insufficient.

Physics - Based Reasoning: It fails to accurately follow physical laws such as energy and mechanics and only shows a "simulation" at the visual level.

Rotation Reasoning: Small - angle rotation can be approximately achieved, but the structure collapses at large angles.

Table & Chart Reasoning: It can imitate local visual patterns but lacks a real understanding of numerical relationships.

Object Counting Reasoning: It performs well in static scenes but often misses or double - counts in dynamic environments.

GUI Reasoning: It can generate click or drag actions but lacks awareness of the operation purpose and logic.

Embodied Reasoning: It can recognize object positions and actions but does not follow environmental rules and occasionally generates "cheating" results.

Medical Reasoning: It has a superficial ability when magnifying or observing local details but cannot maintain the logical consistency of the image, and structural errors often occur.

MME - CoF: The first video reasoning benchmark

The research team compiled the MME - CoF benchmark based on the above empirical research to evaluate the reasoning potential of video models in a standardized way. Its main features include:

The first framework for systematically quantifying the reasoning ability of video models;

Covering 12 dimensions and 59 carefully designed tasks;

Innovative prompt - based design: Transforming abstract reasoning tasks (such as physics, geometry, and counting) into visual video generation challenges, forcing the model to show procedural thinking through "Chain - of - Frame reasoning".

The following table shows the evaluation results of various video generation models on the MME - CoF benchmark. The scoring was done by Gemini - 2.5 - Pro, and the scale ranges from 0 to 4. The research team evaluated from five dimensions. Overall, the average scores of each model are generally lower than 2 points.

Conclusion: Reasoning or performing?

Based on the empirical analysis of Veo 3 and the quantitative evaluation results of many video models, the researchers drew the following conclusions:

1. It does not yet have independent zero - shot reasoning ability - the model mainly relies on data patterns rather than logical deduction.

2. Strong generation ≠ strong reasoning - its performance mainly comes from pattern memory and visual consistency rather than conceptual understanding.

3. It focuses on appearance rather than causality - the results generated by the model often "seem correct" but are not logically valid.

4. It still has potential in the future - it can be a powerful supplementary module for visual reasoning systems and cooperate with logical models to build a more complete multimodal intelligent system.

Overall, this research provides a clear and systematic empirical analysis and evaluation framework for the academic community, revealing the key gap that video generation models still need to cross in the process of moving from "generation" to "reasoning" and achieving a truly "general visual model".

This article is from the WeChat public account "Machine Intelligence", and is published by 36Kr with authorization.