HomeArticle

Hard Krypton Exclusive Interview | WANG Zhongyuan, Dean of Beijing Academy of Artificial Intelligence: VLA Will Not Die, but World Model Is the Future

邱晓芬2026-06-15 09:50
The relationship between the world model and embodied intelligence is essentially the relationship between the "brain" and the "body".

Author | Qiu Xiaofen

Editor | Yuan Silai

In the past few months, the "World Model" has rapidly expanded from an academic jargon to a key term in the AI and robotics industries.

Behind the industry's focus lies real anxiety.

On the one hand, after two years of wild growth, embodied intelligence has exposed the current shortcomings of AI in the physical world. Robots can recognize objects but don't understand that "pushing a cup will make it fall"; they can understand instructions but can't predict "how much force is needed to unscrew a bottle cap." The world model aims to make up for this shortcoming, enabling robots to learn the laws and causality of the physical world.

In other words, the relationship between the world model and embodied intelligence is essentially the relationship between the "brain" and the "body."

On the other hand, after exploring large language models, vision models, and multimodal models, large models need to move from the virtual world to the next stage in the real world.

However, when capital, technology experts, and industrial resources are all poured into this area, people have no answer as to how the world model will truly be applied.

In the view of Wang Zhongyuan, the director of the Beijing Academy of Artificial Intelligence (BAAI), the current global exploration of the world model is being torn into four distinct paths -

The first type is the language - centered world model, including VLM and VLA. These models predict the next word in the text space and learn the world described by language but cannot understand the underlying physical consequences.

The second type is the pixel - centered world model, such as video - generation models like Sora and Seedance. They learn videos or images in the visual space and learn the world described by pixels.

The third type is the 3D - structure - centered world model, including 3D reconstruction and the World Labs Marble model of Fei - Fei Li's team. However, reconstructing a 3D space does not equal understanding the world, and geometric structures do not represent physical states.

The fourth type is the visual - representation - centered world model, such as Yann LeCun's JEPA series of models. They predict the compression of visual representations, but the evolution of visual embeddings does not equal the evolution of physical laws.

Wang Zhongyuan, the director of the Beijing Academy of Artificial Intelligence (BAAI) (Source: Company)

As a non - profit research institution, the Beijing Academy of Artificial Intelligence (BAAI) is also a leading force in the field of world models in China.

Differently, the BAAI is currently trying a fifth category - centered around language and vision, integrated into a unified "latent space representation": All modalities are compressed into the same latent space, and then different "decoders" restore them into different output forms as needed.

For example, this "latent space" is like a "universal scratch paper" for the robot's brain. Whether it's the video images it sees or the text instructions it hears, everything is first compressed into a "secret note" that only AI can understand on this "paper." When needed, the robot will draw the next scene, perform the robot's actions, or calculate the position and force of objects based on the same note.

In the first few years of moving towards the world model, the BAAI's actions in AI have been like a progressive "serial drama," step by step building a general foundation from the digital world to the physical world -

From the early "Wudao" large model, the BAAI brought the narrative of large models in China from scratch to the public eye. Then, it gradually guided the competition towards the native multimodal unified architecture (including Wujie·Emu3/Wujie·Emu3.5). After that, the BAAI clearly proposed to leap towards "Next State Prediction (NSP)" and integrated this logic into the deployable systems of Wu·Physis and Wujie·RoboBrain Orca.

During this period, Tang Jie, Yang Zhilin, Liu Zhiyuan, Wang He, and other core founders of many leading companies in the industry, such as Zhipu AI, Dark Side of the Moon, Mianbi Intelligence, and Galaxy General, have also conducted relevant research at the BAAI.

Although the world model is extremely popular, Wang Zhongyuan maintains a rare calm in the face of this upsurge. He believes that the world model is probably at a stage similar to that of deep learning around 2012 - at that time, data islands were serious, the development path was undetermined, benchmarks were inconsistent, and the era of ChatGPT had not yet arrived.

In his view, the next tough battle for the world model lies in a comprehensive competition in several dimensions.

First, the model cannot only generate pictures that look real but do not conform to real physical laws, such as generating "flying pigs." It also needs to have long - term consistency, not just a few seconds of video but a continuously changing state.

Second, the world model must conduct causal logic inferences. It needs to understand the relationship between actions and results. For example, it should understand what will happen when a cup with a lid and a cup without a lid fall at the same time. Finally, the world model needs to be applied as a base model to various scenarios, not just serving a single demo or task.

In terms of application, in his view, the value of the world model will be realized in two major directions. In addition to breaking through the bottlenecks of embodied intelligence and serving the robotics field, the world model can also be widely applied to real - world physical scenarios such as serious industries, physical simulations, and scientific research.

"We expect that the world model will become the real brain of robots in the future. The world model solves problems that VLA and VLM cannot solve and provides generalization, long - range, complex task, and active exploration capabilities. But this will be a long - term process, which may take three years or even longer." Wang Zhongyuan said.

Recently, Wang Zhongyuan talked with media such as Hard Krypton about his views on the world model and the connection point between the world model and embodied intelligence. The following is the interview transcript (slightly edited):

Four Paths of the World Model

Hard Krypton: Why did the BAAI Conference focus on the "world model" this year? What is the relationship between it and the previous large - model development path?

Wang Zhongyuan: We didn't suddenly propose the concept of the world model. As early as the 2024 BAAI Conference, we made a prediction about the development path of artificial intelligence: after large language models, we will enter the native unified multimodal stage, then combine with the physical world and hardware, further move towards AI for Science in the micro - world, and finally lead to physical AGI.

This year's BAAI Conference has two major themes, one is the world model, and the other is agents. Agents are very popular now, especially AI Coding has entered a booming stage; the world model is the next - generation base - model problem that we believe AI must face when moving from the digital world to the physical world.

Hard Krypton: What are the current technical paths for the world model?

Wang Zhongyuan: There are currently four mainstream paths:

The language - centered path (such as Gemini3): It can perceive multimodal data, think and describe the next state through language, and has planning and decision - making capabilities.

The pixel - centered path (such as Sora): It is suitable for video generation but does not understand physical causality;

The 3D - structure - centered path (such as the Marble model of Fei - Fei Li's World Labs): It aims at digital - world simulations such as the metaverse and games;

The visual - representation - centered path (such as Yann LeCun's V - JEPA series): It predicts the compression of visual representations, but the evolution of visual embeddings does not equal the evolution of physical laws.

Hard Krypton: Which path does the BAAI's world model belong to?

Wang Zhongyuan: We prefer to learn world knowledge in the latent space, that is, Latent Relation. We try to truly compress world knowledge into the latent space and then output Language, Action, and Vision through different decoders.

The BAAI chooses to try the possible integration of the language - centered and visual - representation - centered categories. The reason is simple. The world model not only needs to "see" the physical world but also "understand" and "make decisions." For example, when a human sees a half - filled glass of water being knocked over, the brain will automatically predict "the direction of the water flow and the influence of the ground material on the flow rate." This ability requires a deep integration of visual signals and language reasoning, rather than just generating pictures.

I also agree with Yann LeCun's judgment on the "limitations of large language models," but I don't think language models are unimportant. Language is the carrier of human knowledge, and giving up language means giving up the physical common sense accumulated by humans.

Hard Krypton: Many companies now call video - generation models world models. What's your view?

Wang Zhongyuan: I clearly believe that video generation is not equal to the world model. The term "world model" is widely used now, largely because OpenAI used the term "World Simulator" when releasing Sora.

Using the world simulator to describe video generation is relatively accurate, but the video - generation model itself is not equivalent to the world model. The popular World Action Model this year, which combines videos and actions, also cannot fully represent the real - sense world model.

In my view, the real world model should be the next - generation base model for the real physical world. It is not just about generating a seemingly realistic video but about understanding the state changes, action causality, long - term sequence consistency, and generalization ability of the real physical world.

The core of the language - model era is Next Token Prediction, that is, predicting the next token. The core of the world - model era should be Next Physical State Prediction, predicting the next physical state.

Language models can be stimulated by prompts, while world models need to be stimulated by states. Language models are more passive observers, while world models must actively interact. Language models can be single - modal or multimodal, while world models must move towards full - modality.

Hard Krypton: What capabilities does a real world base model for the physical world need to have?

Wang Zhongyuan: I think it needs at least the following capabilities.

First, it needs to be physically correct. Optical refraction, gravity, fluids, and object motion must conform to real physical laws. A video - generation model may generate a group of flying pigs, but the physical world doesn't work like that. If a robot is equipped with a "brain" that cannot distinguish between reality and fantasy, it may think it is Iron Man, which will bring serious risks.

Second, it needs to have action - causality traceability. The model cannot only know the changes in pictures but also know what results an action will lead to. For example, whether a bottle of water has a lid or not, when a human sees it about to fall, they will naturally predict different consequences. The world model needs to learn this relationship between actions and consequences.

Third, it needs to have long - term sequence consistency. Many video - generation models can generate 5 - second, 10 - second, or even 1 - minute videos, but it doesn't mean they really understand time. If you add water to a bottle and there is a clock beside it, when the camera moves away and then back, the model should know that 10 or 20 seconds have actually passed in reality, rather than randomly generating a seemingly reasonable picture.

Fourth, it needs to have generalization ability. The world base model must be applicable to multiple downstream scenarios, just like large language models can be used for many tasks. It cannot be just a tool for a specific scenario but should be able to serve various tasks such as embodied intelligence, physical simulation, and scientific prediction.

The World Model May Become the Real Brain of Robots

Hard Krypton: What is the biggest problem with current embodied intelligence?

Wang Zhongyuan: I think embodied intelligence is still in a very early stage. Most of the current embodied models are single - scenario, passive task executors. They can work under specific factories, specific tasks, and specific data. For example, when they see a package, they perform tasks such as grabbing, sorting, and placing. In such scenarios, VLA or simpler models may be effective.

However, the problem is that they are difficult to generalize. The real physical world is complex, with time, space, physical laws, and various tools and environmental changes created by humans. If robots only passively execute instructions and solve problems one scenario at a time, it will be difficult for them to be truly deployed on a large scale.

I think the world model and embodied intelligence complement each other. Embodied intelligence has exposed the current shortcomings of models in the physical world, and the world model aims to make up for these shortcomings.

Hard Krypton: There are also some views saying that "VLA is dead." Is the world model a necessary path for embodied intelligence? What is the relationship between them?

Wang Zhongyuan: My judgment is that VLA is the present, and the world model is the future.

VLA is certainly useful and has great value. It can promote the deployment of robots in specific scenarios. In some specific scenarios, a more complex world model may not be needed. As long as the robot sees a package and performs actions and collects specific data, it can complete the task.

However, VLA has limitations. For example, it has insufficient generalization ability, and its ability to handle long - range tasks, complex scenarios, and understand spatial physical laws is also insufficient. Moreover, VLA models are often large, with high deployment response speed and latency, which may not meet the requirements of high - frequency action execution in the real physical world. The world model aims to solve more fundamental problems.

Hard Krypton: Can you give a specific example of how the world model helps robots perform tasks?

Wang Zhongyuan: The world model should not only generate data or videos. Its more important ability is to predict the possible future states based on the current context and state and make the optimal decision at the moment.

You can use an analogy to understand it. It's a bit like Doctor Strange seeing different futures and then choosing the optimal result. When a robot faces a real environment, it also needs a similar ability. It needs to understand the current environment, predict the consequences of different actions, and then choose the most appropriate action.

For example, in a hotel or home environment, when a robot sees a door, a hand gesture, a voice command, and the state of a room, it needs to make a judgment based on historical memory and the current context: should it close the kitchen door, the room door, or take other actions? This is not a simple image recognition or a simple execution of language instructions but a physical decision after complex reasoning.

This is what we mean by the world model commanding the physical entity of an agent to perform actions. After execution, it also needs to continue to collect feedback, evaluate whether the task is completed, and enter the next round of state prediction and decision - making.

Hard Krypton: Will the world model eventually become the real brain of robots?

Wang Zhongyuan: This is our expectation. We hope that the future world model can be deployed on real machines in embodied scenarios and solve problems that VLA, VLM, and traditional action execution cannot solve. Robots need generalization ability, long - range task ability, complex reasoning ability, and active exploration ability. They not only need to understand the world but also understand the consequences of actions, plan future states, command robots to perform actions, and correct decisions based on feedback.

This will not be completed in the short term. The world model is a long - term process. It aims not at short - term application but at the core ability of the next AI era. In the short term, embodied robots will first be deployed in specific scenarios, collect data, and form a closed - loop. In the long term, the world model has the opportunity to become the real brain of robots.

<