Two papers from NVIDIA have introduced a new paradigm for embodied intelligence after VLA.
In 2025, the hottest term in the field of embodied intelligence was VLA (Visual-Language-Action Model).
It became a consensus sweeping across the entire industry, a standard answer for embodied foundation models. In the past year, capital and computing power have poured into this track crazily. Basically, all major model manufacturers have been using this paradigm.
However, soon, the real physical world poured a bucket of cold water on all practitioners. Because VLA is very weak in physical action execution.
It can understand extremely complex text instructions. But when the robotic arm really tries to grasp an object, it may not even do well in adjusting the wrist posture to avoid the obstruction of the cup handle, let alone perform actions involving complex physical deformations such as untying shoelaces.
Another fatal pain point of VLA is generalization. Originally, the reason why people update models is to avoid programming for each special environment, emphasizing the generalization ability of large models. But now, VLA basically cannot generalize any actions beyond the training - specified environment, and it can't even perform tasks in environments similar to the training environment.
The entire industry attributes the inability to generalize to the lack of data. Big companies have begun to invest billions of dollars to collect data in various ways, trying to fill the common - sense gap of VLA with a large number of simulated demonstrations.
However, in early 2026, NVIDIA published two papers, "DreamZero: World Action Models are Zero - shot Policies" and "DreamDojo: A Generalist Robot World Model from Large - Scale Human Videos", constructing a brand - new paradigm for the basic model of embodied intelligence and breaking the deadlock of data involution.
Together, they present the possibility of an embodied model that can learn entirely from videos and generalize to perform different tasks in a zero - shot manner.
01 What VLA Lacks is Not Data, but a World Model
To understand the subversiveness of DreamZero and Dream Dojo, we must first analyze the systematic defects of VLA from the bottom up.
The biggest problem with VLA is the lack of a world model. The underlying architecture of VLA limits its way of cognition. In terms of lineage, VLA has a closer relationship with LLM, but a weaker relationship with pure vision and pure physics. It maps the pixel blocks of an image to the semantic space of text through the cross - attention mechanism. In this space, it understands the concepts of a cup and a table and their relative positions in a two - dimensional image.
However, the physical world is not a two - dimensional semantic slice. The physical world is continuous, full of mass, friction, gravity, and geometric collisions.
VLA has a relatively weak understanding of physical actions and the world because it is essentially a "translator".
We can use the state - transition equation in physics to explain. A complete world model essentially learns a conditional probability distribution. Given the current state of the world (visual observation) and the action the robot is about to perform, it can predict what the world will be like in the next second.
VLA has never learned this equation. VLA learns the functional relationship that directly maps static visual observations + language instructions to executable actions; but it has not been systematically trained to predict the consequences of actions and conduct counterfactual error - testing. So once the environment, materials, or constraint relationships change slightly, its performance will decline sharply.
This is like asking a person to memorize the answers to ten thousand geometry problems without understanding geometric principles. When encountering the original questions, he can quickly write perfect answers; but when facing new questions with slightly changed conditions, he will completely break down.
The generalization of VLA is essentially just interpolation in a high - dimensional semantic space. When the physical form exceeds the envelope surface of the training set, the interpolation will fail.
In contrast, video generation models are quite different. The physical interaction scenes generated by Veo3, Sora 2, and the recently popular Seedance 2 are quite realistic. The actions of fluids, rigid bodies, and flexible materials are so coherent that they are almost indistinguishable from the real world. This shows that large - scale video generation models may have implicitly compressed and internalized the basic operating laws of the physical world in a large number of Internet videos, forming some world models.
Even though they are so powerful, video generation models were mainly used to provide simulation data for VLA before, rather than being integrated into the robot's workflow.
Actually, the idea of using video generation models to control robots didn't start from this. Before DreamZero, the academic and industrial circles also proposed multiple solutions. But all these methods got stuck in engineering and logical dead - ends.
For example, LVP (Large - Scale Video Planner). Its idea is to directly generate a future video plan on how to complete a task from an image and a sentence. Then reconstruct the human hand movements in the video into a 3D trajectory. It uses video pre - training, rather than language pre - training, as the main axis of the robot's basic capabilities.
The second method is similar to NVIDIA's own DreamGen, which generates a video first and then infers the actions backward. This was a route that was highly anticipated before. It divides the architecture of the entire foundation model into two halves. The upper half is a video model responsible for predicting the future; the lower half is an independently trained IDM network responsible for inferring and outputting actions based on the predicted video.
The biggest problem with the above two phased models is the misalignment between action and video generation. The action part requires high accuracy, but it is difficult to generate perfect videos. Once the generated future images have tiny pixel artifacts or physical illusions, both IDM and point tracking will be completely confused, magnifying errors exponentially. If the position of the robot's fingers in the video is off by one micrometer, the robot in the real world won't be able to grasp anything at all. The robustness is extremely poor.
The third method is Unified Video - Action (UVA, Joint Video - Action Generation). This is the most advanced method. It tries to learn video and action in the latent space of the same diffusion model, taking into account both video prediction and action prediction. During inference, it skips video generation through "decoding decoupling" to ensure speed. However, its architecture uses a bidirectional diffusion architecture. To match the length of the language instructions, the generated video sequence must be greatly compressed. This completely distorts the original video time - flow. With the distorted time, it is almost impossible to align action instructions with visual images, so the generalization ability of this method is naturally very poor.
In addition, all these methods have a fatal common defect: they are too slow. Video diffusion models need multiple steps of iterative denoising. Generating a few seconds of actions often requires dozens of seconds of computation. If it takes a robot five minutes to put a bowl into the cabinet, you'll probably be extremely impatient just watching.
Therefore, among all the new embodied intelligence enterprises before 2026, almost only 1X Technologies, which recently launched a household robot, has been trying this video prediction method. They use a large amount of "Shadow Mode" data, that is, when humans perform remote operations, the model runs synchronously in the background for prediction. They use this extremely high - quality paired data to train that fragile IDM.
However, temporary failure does not mean that the direction is wrong.
At last year's robot conference, I interviewed many domestic scholars in the field of embodied intelligence. At that time, Google's Veo 3 and Genie 3 had just been released not long ago. Most scholars were deeply impressed and realized the world - understanding ability of video generation models.
Therefore, in the communication, they almost unanimously proposed that generation might be the most reliable path for subsequent embodied intelligence. This is more likely than generating data in a simulated environment. Simulators (such as Isaac Gym or MuJoCo) are limited by the human - hard - coded physical engine and can never exhaust the complexity of real - world materials, the variability of light and shadow, and the non - linearity of contact forces. The generation model that absorbs all human video data is the real super - simulator that contains the physical laws of all things.
However, at that time, this thinking still remained at the level of "data", and the idea of video generation replacing VLA had hardly come into view.
However, NVIDIA's research may be the turning point that makes this idea an effective engineering path for the first time.
02 DreamZero, Embodied Intelligence Based on a World Model
As mentioned before, there are three main problems when using video generation models to construct robot actions in the past.
One is the alignment problem caused by the step - by - step approach. The second is the problem that the unified mode is too poor to use. The third is the problem of being too slow. In response to these problems, NVIDIA first presented a solution with DreamZero.
First of all, DreamZero adopts the end - to - end training method of synchronizing video and action prediction. This solves the misalignment problem of the previous phased models.
Secondly, in response to the spatio - temporal disorder problem of UVA, DreamZero completely abandons the early bidirectional architecture and instead constructs a 14B - parameter autoregressive Diffusion Transformer (DiT). This is the current standard architecture for video generation models. It predicts videos and actions strictly in chronological order, from left to right, just like a language model generates text. It predicts both video and action in the same diffusion forward pass.
This brings two benefits. First, it preserves the original frame rate, and actions and images are absolutely aligned on the time axis. Second, it uses the KV Cache (Key - Value Cache) technology. The model doesn't need to recalculate historical images from the beginning every time, greatly saving computing power.
After that, to solve the "error accumulation" and hallucination problems caused by autoregression, DreamZero also introduces real - world observation injection.
The model predicts the images and actions for the next 1.6 seconds, and the robot executes them. At the moment when the action is completed, the absolutely real current physical world image captured by the camera is obtained, directly encoded, and stuffed into the KV Cache, covering and replacing the fake image just generated by the model.
This step instantly cuts off the causal chain of error accumulation. The model is forced to always stand on the absolutely real physical foundation to think about the next step.
Finally, and most importantly, is to solve the problem of slow generation.
To meet the frequency requirements for robot control, DreamZero invented the DreamZero - Flash technology. The diffusion model is slow because it needs to go through a long denoising chain during inference. If the number of steps is forcibly reduced (for example, only using one - step denoising), the quality of the generated actions will decline sharply because the image is still in a noisy and blurred state, and the model cannot extract accurate actions from it.
The solution of DreamZero - Flash is "decoupled noise scheduling". During training, it no longer keeps the video and action at the same noise level. It forces the model to predict completely clean and accurate action signals while looking at extremely blurred and highly noisy visual images. This is equivalent to training the model to make correct responses based on physical intuition when it can't see the future clearly.
For humans, this is an impossible task. If you can't see clearly, you can't perform actions. But for the model, this seems to work perfectly. After this training, during the inference phase, the model only needs one - step denoising to generate accurate actions. The inference time is instantly compressed from 350 milliseconds to 150 milliseconds.
This enables the system to output action blocks at a frequency of 7Hz. Combined with the underlying controller, relatively smooth real - time execution is achieved.
After this series of improvements, DreamZero demonstrates the terrifying potential of the video - generated world model.
The most prominent one is the generalization ability. In the test of the AgiBot dual - arm robot, the researchers proposed tasks that were completely unseen in the training set, such as untying knotted shoelaces, taking off a hat from a mannequin's head, and painting with a brush.
If a VLA trained from scratch is used, the task progress is almost zero, and it can't even start well. But the average task progress of DreamZero reaches 39.5%, and for some specific tasks (such as taking off a hat), it even reaches 85.7%.
This is because the learning process of DreamZero is subversive. By jointly predicting videos and actions during training, it is forced to establish causal chains of things' evolution in the latent space. It knows that if the gripper is not released, the clamped object will not fall;