Why Robots Can't Understand the World Solely Through Vision

The rise of AI world models, and the video path becomes the key. Zhu Jun: A breakthrough will be achieved in 2026.

In 2026, AI is moving from the "generative model era" to the "world model era."

Recently, the team led by Yann LeCun, a Turing Award laureate, released the LeWorldModel based on the JEPA architecture. At the end of 2025, World Labs, led by Fei-Fei Li, a professor at Stanford University, launched the first commercial 3D world model, Marble. From an industrial perspective, almost all embodied intelligence companies are listing the world model as a core technological direction, attempting to enable robots to truly understand and predict the physical world. The world model has become an industry consensus.

From the current technological paradigm, several routes have roughly emerged around the world model:

The first is the JEPA abstract prediction path. It doesn't pursue pixel-level restoration. The focus is on learning the causal structure and physical laws of the world in a compact latent space. The LeWorldModel released by Yann LeCun recently is the latest progress in this route.

The second is the 3D/simulation-driven path. It is more inclined to let the model learn physical laws and interaction logic in a "computable world" by constructing a controllable virtual environment or 3D reconstruction. World Labs led by Fei-Fei Li and its product Marble are representative of this path.

The third is the video-driven path. Starting from the video generation model, the model can not only "understand" what's happening in the video but also understand the underlying physical laws, and predict and generate actions based on this, moving from "generating videos" to "understanding the world and taking actions." Currently, companies like Shengshu Technology and Runway are exploring in this direction.

In this route competition, Jun Zhu, the founder of Shengshu Technology and the deputy dean of the Institute for Artificial Intelligence at Tsinghua University, is one of the firm promoters of the video path. At the "AI Future Forum of the Zhongguancun Forum" on the afternoon of March 29, Jun Zhu said, "The general world model is the bridge connecting the digital world and the physical world, and video is the most natural data form to record the real world."

Jun Zhu predicts that the world model will become the core "intelligent center" of various intelligent agents in the future and will achieve rapid breakthroughs in 2026.

After the meeting, Tencent Technology had a further exchange with Jun Zhu: Against the background of multiple parallel technological routes, why is the video path likely to be the first to complete the closed-loop of the world model's capabilities? How to promote the implementation rhythm of this path in real scenarios? What are the current core technological and data difficulties, and in which application scenarios will breakthroughs be achieved first?

The following is the essence of the exchange with Jun Zhu:

Q: Why has the "world model" become a major industry trend this year?

Jun Zhu: This is actually a gradual evolution process.

Compared with previous model paradigms, the world model has more comprehensive requirements for capabilities. It not only needs to understand language and be able to have conversations but also needs to have multi-modal capabilities, such as image recognition, video understanding, and even other modalities like touch. At the same time, it also needs to have the ability to generate actions, so the overall complexity is higher.

From the perspective of the technological development path, there is also a relatively clear evolution sequence: First, the development of language models, and then the breakthrough of video models. After the progress of video models, we saw a very natural and crucial transition - the video native model can be extended to the understanding of the physical world. Once the action ability is introduced, the unified architecture of the world model is gradually formed.

Therefore, to some extent, the rise of the world model is inseparable from the rapid progress of video models. At the same time, with the continuous integration of more modalities and capabilities, this direction will also continue to evolve in more new dimensions.

Q: Currently, scholars like Yann LeCun and Fei-Fei Li are also exploring the world model from different paths, such as being more inclined to 3D reconstruction or simulation environments. Why did Shengshu Technology choose the "video" as the core path? What is the essential difference between this path and other directions?

Jun Zhu: We always think about this problem from the first - principles of the foundation model. To build a foundation model, it essentially depends on two core elements: First, there must be a sufficient scale of data, and second, the model architecture itself must be able to scale up.

In terms of the model architecture, we were one of the early teams in the industry to adopt the DiT (Diffusion Transformer) architecture, and we have also verified that this path can continuously improve the model performance by increasing the parameter scale.

At the data level, we believe that video is currently the most suitable and general data form to record the real world. It not only contains rich laws of the world's operation but also naturally contains a large amount of action and behavior information. Moreover, video data has the ability to scale continuously - as the real world is constantly changing, video data can also be generated continuously.

In contrast, another type of path is more inclined to rendering, such as 3D environment modeling or 3D object reconstruction, mainly focusing on the restoration and reconstruction of the scene. Rendering itself is of course valuable, but it mainly serves the visual needs of humans.

However, for machines, it doesn't need to completely restore every pixel detail. It only needs to perceive its own state, understand the next movement law or execute instructions, and it can complete the task.

Therefore, from this perspective, training the model based on video can not only continuously support large - scale training and iteration but also avoid unnecessary rendering overhead, thus having a more significant advantage in efficiency.

Q: Compared with language models, the video path has a higher computational density, and the training and inference costs are also heavier. Will this cost pressure become the core bottleneck for the development of the world model with the video path?

Jun Zhu: The problems of computational cost and computational volume are inevitable for all teams working on large models, but this problem is not insoluble.

The computational method of video is very different from that of language models. Language processing is usually sparse, while the computational density of video is higher. But in the video field, we can make full use of the parallel computing architecture of GPUs. In addition, the algorithm iteration is also very fast now. For example, the low - precision computing method we are working on can make full use of hardware computing power and significantly accelerate the training and inference process.

I believe that the upper limit of intelligent capabilities will be broken through first. Then, with the continuous iteration of algorithms and hardware, the computational difficulties we face today may no longer be difficulties in the future.

Q: In the process of processing and utilizing large - scale video data, where are the current biggest difficulties mainly concentrated? What are the core problems that data governance needs to solve?

Jun Zhu: The core challenge of data processing is that data governance must be coordinated with the model and algorithm, rather than being an independent link. Only after the model and algorithm framework are determined can we truly judge how to clean and screen the data, and even when to annotate or weakly annotate it.

In other words, the value of data is not static but dynamically matched with the model's capabilities. During the training process, we also need to continuously understand the distribution characteristics and structural attributes of the data itself and adjust the data strategy accordingly to achieve continuous improvement of the overall performance.

Especially in the video data scenario, the difficulties will be more prominent. On the one hand, the scale of video data is larger and more redundant. How to efficiently screen out "effective information" is a key problem.

On the other hand, the temporal information and action information hidden in the video are not as naturally structured as text, which also puts forward higher requirements for data processing and utilization.

Therefore, in essence, this is not only a data problem but also a problem of the integrated coordination of "data - model - algorithm." This requires the team to continuously polish in long - term practice and also puts forward higher requirements for the system capabilities and engineering accumulation of the large - model team.

Q: Without labels, how can the model truly learn "executable capabilities" from videos?

Jun Zhu: Our core idea is to theoretically connect the two types of capabilities of "generation" and "action" through a unified world model framework.

Under this framework, we use large - scale unlabeled video data for training and build an extensible general base model. The model no longer just passively understands the video content but gradually establishes a closed - loop of capabilities from "perception - prediction - decision - action" through learning the temporal information and behavior patterns.

And we have conducted preliminary verification on various types of tasks. For example:

CAPTCHA operation task: Simulate human mouse operation through a robotic arm to achieve screen recognition and precise clicking.

Chess decision - making task: Involves long - range planning and multi - step reasoning, requiring the coordination of perception, prediction, and decision - making.

Flexible object manipulation: Achieve stable grasping of complex and irregular objects.

In the experiment, we observed two key phenomena:

First, the data scaling effect is significantly enhanced. Compared with the traditional VLA route, the data utilization efficiency has been improved by an order of magnitude.

Second, the multi - task generalization ability is significantly enhanced. Under the unified model, we can achieve efficient generalization on more than 50 tasks, and as the number of tasks increases, the performance not only does not decline but also improves. In contrast, traditional VLA models (such as PI0.5) often experience performance degradation when the number of tasks increases. This also shows from the side that integrating the generation ability and action ability in the same system through a unified architecture may represent a new development path.

Q: A large amount of video data often only presents the results, lacking a complete causal process. In this case, how can the model avoid only learning surface correlations? Can video data really support "causal understanding"?

Jun Zhu: Indeed, not every video can present a clear causal chain completely. But the core advantage of video data lies in its scale and diversity.

Take a simple "picking up a water cup" action as an example. In a large number of videos from different sources, there will be various grasping methods, different environments, and operation processes under different constraints.

For large models, it is precisely this large - scale and diverse distribution that enables them to induce action patterns with generalization ability, rather than relying on single, standardized data samples. In contrast, fixed - collection data or simulation data, although more structured, are limited in coverage and diversity.

Therefore, we don't simply rely on a single video to learn causal relationships. Instead, through the distributed information in massive data, we let the model gradually approach a more stable "causal structure" at the statistical level.

Q: With the rapid growth of video data scale, how to further increase the proportion of "effective data"? What methods can truly improve the value of video for model training?

Jun Zhu: To improve the effectiveness of video data, we can mainly start from two directions.

On the one hand, actively construct high - quality data. For example, through first - person perspective collection, introducing structured annotation or weak annotation information. Although this type of data is more costly, the information density is higher, and it has a more direct impact on improving the model's capabilities. Its proportion will gradually increase in the future.

On the other hand, make full use of general video data. The Internet has accumulated a large amount of videos recording daily behaviors and the operation laws of the physical world. This type of data has natural advantages in scale and diversity and can be an important basis for model training.

In essence, these two types of data are complementary: one improves the "information density," and the other provides "scale and coverage," jointly supporting the continuous improvement of the model's capabilities.

Q: Currently, the entire industry is exploring the world model, but it seems that there is no unified technological paradigm like the "Transformer" yet. What do you think of the current stage? What are the key bottlenecks?

Jun Zhu: From the perspective of the core path of video generation, the architecture has gradually become unified. The current mainstream paradigm is the architecture based on DiT (Diffusion Transformer). We were also one of the early teams to explore in this direction and verify its scalability.

From the perspective of industry development, currently, most commercial video models are evolving along this architecture. When the world model extends from video generation, it naturally inherits this technological route. Moreover, this architecture has good scalability and can continuously enhance the model's capabilities by increasing the parameter scale and training data.

Of course, from a broader perspective of the "world model," there are still some key challenges. For example, how to further unify multi - modal capabilities, how to form a more stable closed - loop between perception and action, and how to maintain generalization ability in more complex tasks. These issues are still being continuously explored.

Q: Currently, many robot manufacturers are also developing their own models, but the technological routes vary greatly. From the perspective of the world model, where are the real core competitive barriers mainly reflected?

Jun Zhu: We still understand this problem from the first - principles.

Although there are multiple implementation paths for the "world model" currently, in essence, it needs to have three core capabilities:

First, it should be able to observe and understand the world.

Second, it should be able to predict future states.

Third, it should be able to learn and generate actions based on this understanding and prediction.

Based on these three elements, we can judge whether a system has the complete capabilities of a world model. For example, many current VLA (Vision - Language - Action) models mainly focus on the mapping from "perception to action," but they are still relatively lacking in the intermediate prediction and "imagination" links; some paths that are more simulation - oriented mainly stay at the level of visual presentation and reconstruction.

From the perspective of competitive barriers, it is actually consistent with the development logic of large models: The key still lies in whether the data scale can be continuously expanded, whether the model parameters are scalable, and whether there are computing resources to support large - scale training. These three points jointly determine the upper limit of the foundation model.

Q: Currently, many manufacturers are entering the market through agent products, while you emphasize the world model as a basic ability. What is the difference in generalization ability between these two paths?

Jun Zhu: I think these two paths are not in conflict. In essence, they are at different levels.

The agent is an application form mainly used to solve tasks in specific scenarios. Currently, most agents are built based on language models and complete specific goals through tool invocation and process orchestration. The world model is more like an underlying infrastructure. It not only focuses on language understanding but also includes the perception, prediction, and action capabilities of the physical world.

From the perspective of generalization ability, the ability boundary of an agent largely depends on the underlying foundation model. If the underlying model is mainly a language model, it will have certain limitations in understanding and acting in the physical world; while the world model attempts to build a more general ability system, enabling the model to achieve stronger generalization in different scenarios, tasks, and even environments.

Therefore, the two are more likely to be integrated. In the future, robots are likely to be agents in the physical world, capable of performing diverse tasks in an open environment, but they need a set of general base models to support the generalization of capabilities across scenarios, tasks, and even entities.

This is also the direction we are trying - to build a basic ability platform centered on the world model to provide a stronger ability boundary for upper - layer agents.

Q: Based on the world model, in which scenarios are most likely to see the first implementation in the next three to five years? What are the key driving factors for achieving breakthroughs?

Jun Zhu: Currently, we are mainly focusing on some of the most challenging general open scenarios, such as home and office environments.

This type of scenario is fundamentally different from structured environments such as factories - they are highly open and complex, and it is difficult to complete tasks through preset rules or processes. Therefore, higher requirements are put forward for the model's generality and generalization ability.

Precisely because of the high difficulty, once these scenarios are broken through, the value will be very significant.

From the perspective of the development rhythm, we are relatively optimistic about this direction. With the continuous accumulation of data scale, the continuous maturity of the model architecture, and the gradual improvement of computing

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Sorry, robots cannot understand the world solely through vision.

The following is the essence of the exchange with Jun Zhu: