World Model: "Building a World" is feasible, but it is not the future that Embodied AI pursues
From VLA to WAM: An Overestimated Revolution and an Underestimated Evolution.
In the past six months, there have been two of the most sensational public spectacles in the field of embodied intelligence. One belongs to the screen: From Sora to various video generation models, they have successively demonstrated their capabilities. The details of a glass of water spilling and the movement of people in continuous space have pushed the narrative of "AI recreating reality" to its peak, and exclamations of "The world model is here" have echoed everywhere. The other belongs to the tombstone: Jim Fan, the chief research scientist at NVIDIA, used a meme image of WAM (World Action Model) standing in front of the tombstone of VLA (Visual - Language - Action Model) to declare "VLA is dead, long live the world model", directly bringing the route dispute to the forefront. (This article only discusses the world model of embodied intelligence.)
These two spectacles share the same core term: the world model.
However, paradoxically, the more people talk about it in the field of embodied intelligence, the more blurred its appearance becomes. Some people call video generation models that can generate realistic videos the world model, some call the pre - rehearsal of robot actions the world model, and some also call the autonomous driving simulation environment the world model. Under the same concept, there are completely different technical goals and business demands.
The greatest danger of the current world model is never "poor definition", but that everyone is defining its entire value based on its most easily demonstrated and most likely to create a communication explosion point. When the showmanship of "creating a world" overshadows the essence of "using the world", the world model is being led astray from its real destination by those who are best at storytelling: the real physical scenarios of Physical AI.
The world model certainly needs the ability to "create a world". Without those amazing generation demonstrations, it would not have entered the vision of the public and capital so quickly. But for the Physical AI industry, generating a world is only the beginning of the problem. The world ultimately needs to be controlled, verified, and corrected, and finally become a pre - rehearsal space and decision - making basis before machine actions. Video generation can open the door to the world model, but it cannot complete the journey to the real physical world for it.
We are never short of new concepts and new narratives. Embodied intelligence will surely find its own general path. By then, whether this path is called VLA, WAM, or something else may no longer matter at all.
After all, it has been embedded in our lives.
The World Model Is Not Completely Equal to "Generating Images"
Remember Sora?
When OpenAI released Sora, the title of the report was "Video generation models as world simulators", announcing that video generation models are expected to be a feasible path to the "general simulator of the physical world". The long - videos demonstrated by Sora at that time, with their camera movements, local 3D consistency, and object state maintenance capabilities, made the public intuitively feel for the first time that AI seems to be really learning to "build a world". Compared with text and pictures, video naturally conforms to human intuitive perception of the "world" — with time, space, movement, and continuous changes, which easily creates the illusion that "the model has mastered the physical laws".
This kind of ability is naturally suitable for press conference demonstrations and is most likely to attract the attention of capital and the media. Over time, "Video generation = World model" has become the default cognitive entry for many people.
This is certainly not wrong. In digital native scenarios, the video generation route is an efficient solution, and many unicorn companies have emerged. Their products can be used to generate dynamic scenes in real - time in the game industry, which not only reduces art costs but also enhances player freedom. In high - cost - of - trial - and - error fields such as aerospace and high - end manufacturing, using it to expand the test boundary and enrich the simulation scenario also has clear commercial value. At this time, the "world" generated is not a picture for the audience to watch, but an interactive and test - error - prone simulation environment.
The real misinterpretation occurs when crossing boundaries. When the world model encounters embodied intelligence, many people default that if a model can generate a continuous and realistic digital world, it means it has mastered the ability to understand, predict, and act in the physical world.
Wang Zhongyuan, the dean of the Beijing Academy of Artificial Intelligence, made a sharp judgment on this: The video generation technology currently widely regarded as the representative of the world model is essentially just a pixel - level world simulation. "A video generation model can generate a group of pigs flying in the sky with an airplane because its training data contains a large amount of science - fiction movie content, and its goal has never been to restore the laws of the real physical world."
A classic embodied scenario is enough to illustrate the gap: grasping a cup. The model can generate cups with consistent appearance from different perspectives, which is visual consistency and something it can learn from video data. But after reaching out and touching it, how much friction is there? Can the material withstand the corresponding grip force? When the cup lands on the table, is it because the model remembers that "cups are usually on the table", or does it really understand gravity, support force, and contact constraints? Complex mechanical responses, state changes after contact, and causal constraints of real physical laws cannot be covered by a generated video. When a car moving sideways is generated and put into the training chain of autonomous driving without verification, the real physical world will surely give a painful backlash.
In other words, video generation is a manifestation of the world model and has been implemented in many scenarios, but it is by no means the world model required by embodied intelligence, let alone the core form in the context of Physical AI. Defining the world model of embodied intelligence by the visual effect of "creating a world" is essentially using the ruler of the digital world to measure the problems of the physical world.
Is VLA Dead? The World Model Is Not a Revolution but a Complementary
"VLA is dead, WAM takes over" is the most popular narrative within the industry.
In the past two years, VLA has always been the mainstream path of embodied intelligence. It follows the pre - training idea of large language models and establishes a mapping of "perception - instruction - action" through a large amount of tele - operation data, enabling robots to shift from rigid repetitive actions to understanding natural language and disassembling complex tasks. All mainstream players in the industry have once used VLA as the core technical foundation.
However, the shortcomings of VLA are also very clear: It is essentially memory and mapping brought about by imitation learning, lacking an in - depth understanding of physical laws. Once it encounters new scenarios and new objects not seen in the data, its generalization ability will quickly fail. The WAM route proposed by Jim Fan precisely targets this pain point. Its core logic is to shift from "semantic understanding" to "physical prediction": Instead of directly outputting actions, it first predicts the future world state and then reverses the action sequence. This is equivalent to allowing the robot to "rehearse" the consequences in its mind before taking action, thereby improving its adaptability to unfamiliar scenarios.
So the "subversion theory" quickly fermented. VLA is an outdated old paradigm, and the world model is the next - generation answer for embodied intelligence. But in real - world industrial practice, things are far from as simple as "life or death".
The industry is splitting into two clear routes, behind which are different technical philosophies and business demands:
One is the "replacement school" led by Silicon Valley. Represented by NVIDIA and Google DeepMind, relying on sufficient computing power and data reserves, they pursue a complete paradigm reconstruction. NVIDIA incorporates language, images, videos, and action sequences into the same Physical AI world model framework in Cosmos 3, trying to make generation, simulation, and action prediction no longer separate modules. The Waymo World Model jointly launched by Waymo and Google DeepMind, with the help of the Genie 3 model, is not only used to generate long - tail scenarios such as rare weather and animal intrusions but also focuses on making these scenarios controlled by driving actions, road layouts, and language conditions to test the response of the autonomous driving system in counterfactual situations.
This route is the most ambitious and most in line with the "revolutionary narrative", but the threshold is extremely high, making it a game for top giants.
The other is the more common "integration school" in China. Most players do not choose to start from scratch but regard the world model as a complementary set of VLA's capabilities and embed it into the existing architecture. In May 2026, Zhifangfang released the VLA embodied large model AlphaBrain. It draws on the division of labor mechanism of the human brain's "brain - cerebellum - torso" and through the cooperation of the "fast - slow system", embeds the "rehearsal" ability of the world model into the VLA architecture. The slow system is responsible for environmental situation perception and high - level behavior planning, and the fast system is responsible for fine - sensing and rapid feedback. Guo Yandong, the founder of Zhifangfang, made a straightforward judgment: "The world model and VLA are not in conflict at all. They are originally a branch of the same technical route. If you want to do more long - term reasoning tasks, you need the world model + VLA, or combine the world model with VLA."
Yinhe Tongyong has also gone a long way. The LDA - 1B model they released in April this year conducts strategy learning, physical prediction, and visual perception simultaneously in a unified framework, and for the first time, achieves the unification of the world model and the action model on an industrial - level scale of 1 billion parameters. The relevant results have been selected for the top robotics conference RSS, and the model weights and training codes have been open - sourced. They are not entangled in "choosing VLA or the world model" but more pragmatically let prediction and execution share the same model, taking advantage of each other's strengths and compensating for each other's weaknesses.
In our view, there is no absolute right or wrong between "replacement" and "integration", but just different choices at different stages. VLA will not really "die", and the world model is not a revolution that subverts everything. It complements the physical prediction ability that VLA lacks the most. The ultimate relationship between the two is more likely to be hierarchical cooperation rather than a life - and - death struggle. What really determines the outcome of the route is never how trendy the concept is, but who can first run through the chain of data, simulation, and real - machine deployment and let robots truly enter real scenarios.
The World Model Hasn't Been Implemented, but the Conceptual Hype Has Already Blown Up
When the popularity of a concept outpaces its technological implementation, bubbles are almost an inevitable product. In the current world model track, at least three layers of bubbles worthy of attention have emerged.
The first layer is the definition bubble. Today, the "world model" has become a catch - all term. Yann LeCun believes it is the prediction of the world state at the abstract level, Li Feifei defines it as an interactive 3D spatial representation, NVIDIA positions it as a generative simulator for physical AI. Among startup companies, some use video generation to make up the numbers, and some just rename the traditional simulation engine as the world model. Dozens of domestic companies claim to be deploying the world model, but they may not be talking about the same thing at all. When a technical concept can be infinitely interpreted, it often loses the meaning of a technical benchmark. Behind the generalization of the definition is the joint promotion of financing needs and marketing narratives. After all, calling it a "world model" is always more valuable than calling it a "video generation tool" or a "simulation optimization solution".
The second layer is the computing power bubble. The mainstream training route of the world model is based on a large amount of video data and super - large computing power, and this is precisely NVIDIA's home court. Huang Renxun said bluntly at the GTC conference that by 2027, the Blackwell and Rubin chips and the supporting systems designed for embodied intelligence models will bring at least $1 trillion in revenue to NVIDIA. In a sense, the fact that top Silicon Valley players strongly promote the "full - modal general world model" route is highly consistent with NVIDIA's business logic of "selling computing power infrastructure". However, the investment threshold of this route is an almost bottomless pit for most companies. Small and medium - sized teams that previously bet on VLA can hardly bear such a large sunk cost, let alone those starting from scratch in the world model track. When everyone is discussing the same high - computing - power route, but few people can calculate the input - output ratio clearly, this is a signal of a bubble.
The third and most fatal layer is the implementation bubble. All conceptual narratives ultimately have to answer the same question: Can it really improve the performance of real machines? The reality is that the migration gap from simulation to reality will not automatically disappear just because the model name changes from VLA to WAM. A slight model penetration, anti - gravity, or blurred boundary in the video will solidify into wrong physical cognition when used in robot training. A prediction that seems reasonable but violates physical laws can mislead real machines even more seriously than not using model training.
Shen Yujun, the chief scientist of Ant Lingbo once pointed out the core difference: The generation model in the digital world can pursue high - definition and realism, and it doesn't matter if it's a bit slow. But the model in the physical world's primary requirements are speed, stability, and accuracy, and it needs to be able to output feedback in real - time and support actions. Many teams are obsessed with making the scenarios in the digital world more and more realistic but ignore that the data of real physical interactions is the most scarce resource. The world model can achieve beautiful indicators in the simulation, but as long as it has not been verified to have real value in factory production lines, logistics warehouses, and open roads, it is still a technical exploration in the laboratory rather than an industrial - level infrastructure.
So, what should the world model for Physical AI or embodied intelligence look like? The answer is never in the demonstration videos at press conferences but in the needs of real scenarios. Its core evaluation criterion is never "how realistic the generated world is", but "whether it can help machines better act in the physical world", whether it can reduce the cost of trial and error, whether it can improve generalization ability, and whether it can be embedded in a real business closed - loop.
Judging from the current industrial practice, players who are really on the right track are all doing the same thing: making the world model shift from "demonstration - oriented" to "task - oriented". In other words, the ultimate form of the world model is not an independent "product" but a basic ability embedded in various physical systems. It is hidden in the simulation background of autonomous driving, in the action planning module of robots, and in the prediction system of factory production lines, silently completing the work of prediction, trial and error, and correction. Most of the time, users may not even be aware of its existence.
That will be the era of the world model, and of course, it doesn't have to be called the world model.
This article is from the WeChat official account "A Priori Laboratory", author: Vincent, published by 36Kr with authorization.