DeepMind took the lead in proposing CoF: Video models have their own chain of thought.
What's next for the Chain of Thought (CoT)?
DeepMind has proposed Chain of Frames (CoF).
Frame-by-frame video generation is similar to chain-of-thought reasoning in language models. Just as Chain of Thought (CoT) enables language models to perform symbolic reasoning, the "Chain of Frames" (CoF) enables video models to reason in both time and space.
The above view comes from DeepMind's newly published Veo 3 paper. Drawing an analogy to CoT in language models, they introduced the concept of CoF for the first time.
Moreover, through extensive testing, the team found that -
Video models represented by Veo 3 are developing general visual understanding capabilities and can solve the entire chain of visual tasks from "seeing" to "thinking" in a zero-shot manner. They are making rapid progress and are expected to become the "general foundation models" for machine vision in the future.
A more straightforward summary is that "Veo 3 is the GPT-3 moment in the field of visual reasoning."
Anyway, to gain a deeper understanding of this new concept and its value, let's first take a look at the original paper -
DeepMind Introduces the CoF Concept for the First Time
According to the paper, the proposal of CoF stems from a curiosity of the DeepMind team:
Can video generation models, like large language models (LLMs) such as ChatGPT, handle various visual tasks without specialized training for a particular task and eventually become "general visual foundation models"?
Why pursue generality? Currently, the field of machine vision is still in the "old stage of NLP" -
To segment objects, you need to use "Segment Anything"; to detect objects, you need to use YOLO. For different tasks, you have to re-adjust or even re-train the model...
Since current video generation models and LLMs use the same underlying logic - achieving results with a large amount of data - it indicates that general vision is not a baseless claim.
To verify this hypothesis, the team used a very straightforward method: only provide prompts without special training. Through Google's API, they gave the model "an initial image (as the first frame) + a text instruction" and asked it to generate an 8-second, 720p video.
This is completely consistent with the logic of LLMs "using prompts instead of specialized training," aiming to verify the model's inherent general capabilities and let the model complete the task on its own.
Through a series of tests, the team found that video models truly have general potential.
Specifically, they used Veo 3 as the experimental subject and found that it has four capabilities (progressing step by step):
First, without specialized training, Veo 3 can handle many classic visual tasks and has perception capabilities.
Whether it's basic tasks (such as clarifying a blurry image) or complex tasks (such as finding a "blue ball" among a bunch of things), it can handle them easily.
Second, just understanding is not enough. Veo 3 can also "establish the rules of the visual world" and has modeling capabilities.
This is reflected in its understanding of both physics (such as knowing that stones sink) and abstract relationships (such as putting things that can fit into a backpack inside).
Third, based on "understanding" and "knowing the rules," Veo 3 can actively change the visual world and has manipulation capabilities.
For example, it can modify images (add a scarf to a bird and place it in a snowy scene) or perform 3D and simulation (change a knight from facing forward to kneeling on one knee).
Fourth, by integrating the above capabilities, Veo 3 can achieve cross-temporal and cross-spatial visual reasoning, which is the so-called Chain of Frames (CoF).
Give it a maze-solving problem: Let the red dot move from the starting point to the green dot along the white path.
Veo 3 can generate a video of the red dot planning the path step by step without hitting the black walls. In 10 attempts at a 5x5 maze, Veo 3 had a success rate of 78%, while Veo 2 only had 14%.
More reasoning tests also show that although the reasoning ability is not perfect (it may make mistakes in complex rotation analogies), the "embryo of visual intelligence" can already be seen.
Overall, the team drew the following three core conclusions from the tests:
1. After analyzing 18,384 videos generated in 62 qualitative tasks and 7 quantitative tasks, the team found that Veo 3 can solve many tasks it has not been trained or adjusted for.
2. Veo 3 uses its capabilities to perceive, model, and manipulate the visual world and exhibits an early form of visual reasoning similar to the "Chain of Frames (CoF)".
3. Although models customized for specific tasks perform better among zero-shot video models, the team observed a significant and consistent improvement in performance from Veo 2 to Veo 3, indicating that the capabilities of video models are developing rapidly.
"Generalists Will Replace Specialists"
In addition, based on Veo 3's current performance and the prediction that costs may continue to decline, DeepMind also made a bold statement:
In the field of video models, "generalists" will replace "specialists" in the future.
Specifically, as a general video model, Veo 3 still lags behind dedicated SOTA models in specific tasks. For example, its edge detection accuracy is not as good as specially optimized algorithms.
However, from a development perspective, this gap is narrowing as the model's capabilities improve rapidly. Similar to early large language models (such as GPT-3), which were initially inferior to task-fine-tuned models overall, but through the evolution of architecture, data, and training methods, they eventually became powerful general foundation models.
For example, compared with the previous generation, Veo 2, Veo 3 has been comprehensively upgraded in a short period. This proves that the model's general visual and generation capabilities are in a period of rapid growth, similar to the rapid development stage of LLMs around 2020.
Second, through the multiple-attempt (pass@10) strategy, that is, generating multiple times for the same task and selecting the best result, Veo 3's performance is significantly higher than single-generation, and there is still room for improvement as the number of attempts increases, with no obvious upper limit. Moreover, by combining technologies such as scaling during reasoning and RLHF instruction fine-tuning, Veo 3's performance is still expected to be further improved.
In addition, although the current cost of video generation is higher than that of dedicated task models, according to Epoch AI's data - the reasoning cost of LLMs decreases by 9 to 900 times per year, and early general models in NLP (such as GPT-3) were also questioned due to cost, but they eventually replaced dedicated models due to "general value + cost reduction".
Therefore, it is highly likely that machine vision will follow the same path, and the cost problem of video models will be gradually solved in the future.
In summary, DeepMind is very confident in general video models.
The newly proposed concept of CoF, as netizens said, is expected to pave a new way for video models, just like CoT did before.
Paper: https://papers-pdfs.assets.alphaxiv.org/2509.20328v1.pdf
Reference Links:
[1]https://x.com/AndrewCurran_/status/1971997723261075905
[2]https://simonwillison.net/2025/Sep/27/video-models-are-zero-shot-learners-and-reasoners/
This article is from the WeChat official account "QbitAI", author: Yishui. Republished by 36Kr with permission.