HomeArticle

Two papers from NVIDIA have introduced a new paradigm for embodied intelligence after VLA.

36氪的朋友们2026-02-11 17:13
We started teaching robots to dream.

In 2025, the hottest term in the field of embodied intelligence was VLA (Visual-Language-Action Model).

It became a consensus sweeping across the entire industry and a standard answer for embodied foundation models. In the past year, capital and computing power have poured into this field crazily. Essentially, all major model companies have been using this paradigm.

However, soon, the real physical world poured a bucket of cold water on all practitioners. Because VLA is very weak in physical action execution.

It can understand extremely complex text instructions. But when the robotic arm actually tries to grasp an object, it may not even be able to adjust the wrist posture to avoid the obstruction of the cup handle, let alone perform actions involving complex physical deformations such as untying shoelaces.

Another fatal flaw of VLA is generalization. Originally, the reason why people updated models was to avoid programming for each special environment, emphasizing the generalization ability of large models. However, currently, VLA can hardly generalize any actions beyond the training - specified environment, and it can't even perform tasks in environments similar to the training environment.

The entire industry attributes the inability to generalize to the lack of data. Big companies have begun to invest billions of dollars to collect data in various ways, trying to fill the common - sense gap of VLA with a large number of simulated demonstrations.

However, in early 2026, NVIDIA published two papers, "DreamZero: World Action Models are Zero - shot Policies" and "DreamDojo: A Generalist Robot World Model from Large - Scale Human Videos", constructing a brand - new paradigm for embodied intelligence foundation models and breaking the deadlock of data involution.

Together, they present the possibility of an embodied model that can learn entirely from videos and generalize to perform different tasks in a zero - shot manner.

01 What VLA Lacks is Not Data, but a World Model

To understand the subversiveness of DreamZero and Dream Dojo, we must first analyze the systematic defects of VLA from the bottom up.

The biggest problem with VLA is the lack of a world model. The underlying architecture of VLA limits its cognitive approach. In terms of lineage, VLA is more closely related to LLM, while having a weaker relationship with pure vision and pure physics. It uses the Cross - Attention mechanism to map the pixel blocks of an image to the semantic space of text. In this space, it understands the concepts of a cup and a table and their relative positions in the two - dimensional image.

However, the physical world is not a two - dimensional semantic slice. The physical world is continuous, filled with mass, friction, gravity, and geometric collisions.

VLA has a relatively weak understanding of physical actions and the world because it is essentially a "translator".

We can explain this using the state - transition equation in physics. A complete world model is essentially learning a conditional probability distribution. Given the current state of the world (visual observation) and the action the robot is about to perform, it can predict what the world will be like in the next second.

VLA has never learned this equation. VLA learns the functional relationship that directly maps static visual observations and language instructions to executable actions; yet it has not been systematically trained to predict the consequences of actions and conduct counterfactual error - checking. Therefore, once the environment, material, or constraint relationship changes slightly, its performance will decline precipitously.

This is like asking a person to memorize the answers to ten thousand geometry problems without understanding geometric principles. When faced with the original questions, he can quickly write down perfect answers; but when encountering new questions with slightly changed conditions, he will be completely at a loss.

The generalization of VLA is essentially just interpolation in a high - dimensional semantic space. When the physical form exceeds the envelope of the training set, interpolation will fail.

In contrast, video - generation models are different. The physical interaction scenes generated by Veo3, Sora 2, and the recently popular Seedance 2 are quite realistic. The actions of fluids, rigid bodies, and flexible materials are so coherent that they are almost indistinguishable from the real world. This indicates that large - scale video - generation models may have implicitly compressed and internalized the basic operating laws of the physical world in a large number of Internet videos, forming some world models.

Even though they are so powerful, video - generation models were mainly used to provide simulated data for VLA before, rather than being integrated into the workflow of robots.

Actually, the idea of using video - generation models to control robots did not start here. Before DreamZero, both the academic and industrial circles proposed several solutions. However, all of these methods got stuck in engineering and logical dead - ends.

For example, LVP (Large - scale Video Planner). Its idea is to directly generate a future video plan on how to complete a task from a single image and a sentence, and then reconstruct the hand movement in the video into a 3D trajectory. It uses video pre - training instead of language pre - training as the main axis of the robot's basic capabilities.

The second method is similar to NVIDIA's own DreamGen, which generates a video and then infers the actions. This was a highly anticipated approach before. It divides the architecture of the entire foundation model into two halves. The upper part is a video model responsible for predicting the future; the lower part is an independently trained IDM network responsible for inferring and outputting actions based on the predicted video.

The biggest problem with the above two phased models is the misalignment between action and video generation. Actions require high accuracy, but it is difficult to generate perfect videos. Once the generated future images contain tiny pixel artifacts or physical hallucinations, both IDM and point - tracking will be completely confused, magnifying errors exponentially. If the position of the robot's finger in the video is off by one micrometer, the real - world robot won't be able to grasp anything at all. The robustness is extremely poor.

The third method is Unified Video - Action (UVA, Joint Video - Action Generation). This is the most advanced method so far. It attempts to learn video and actions in the latent space of the same diffusion model, taking into account both video prediction and action prediction. During inference, it skips video generation through "decoding decoupling" to ensure speed. However, its architecture uses a Bidirectional Diffusion architecture. To match the length of the language instruction, the generated video sequence must be greatly compressed. This practice completely distorts the original video time - flow. When the time is distorted, it is almost impossible to align action instructions with visual images, so the generalization ability of this method is naturally very poor.

In addition, all these methods have a common fatal flaw, which is slowness. Video diffusion models require multiple steps of iterative denoising, and it often takes tens of seconds of computation to generate a few seconds of actions. If it takes a robot five minutes to put a bowl into a cabinet, you'll probably be extremely impatient while watching.

Therefore, among all the new embodied - intelligence companies before 2026, almost only 1X Technologies, which recently launched a household robot, has been trying this video - prediction method. They use a large amount of "Shadow Mode" data, that is, when a human remotely operates the robot, they let the model run predictions in the background synchronously, and use this extremely high - quality paired data to train the fragile IDM forcefully.

However, temporary failure does not mean that the direction is wrong.

At last year's robotics conference, I interviewed many domestic embodied - intelligence scholars. At that time, Google's Veo 3 and Genie 3 had just been released. Most scholars were deeply impressed and realized the world - understanding ability of video - generation models.

Therefore, in the communication, they almost unanimously proposed that generation might be the most reliable path for subsequent embodied intelligence. This is more likely than generating data in a simulated environment. Simulators (such as Isaac Gym or MuJoCo) are limited by the human - hard - coded physical engine and can never exhaust the complexity of real - world materials, the variability of light and shadow, and the non - linearity of contact forces. A generation model that absorbs all human video data is the real super - simulator that contains the physical laws of all things.

However, at that time, this thinking still remained at the "data" level, and the idea of video generation replacing VLA had not really come into view.

However, NVIDIA's research may be the turning point that makes this idea an effective engineering path for the first time.

02 DreamZero, Embodied Intelligence Based on a World Model

As mentioned before, there are three main problems in using video - generation models to construct robot actions in the past.

The first is the alignment problem caused by step - by - step methods. The second is the problem that the unified mode is too poor to be usable. The third is the problem of slowness. In response to these, NVIDIA first presented a solution with DreamZero.

First of all, DreamZero adopts an end - to - end training method for simultaneous video and action prediction. This solves the misalignment problem of the previous phased models.

Secondly, to address the spatiotemporal confusion problem of UVA, DreamZero completely abandons the early bidirectional architecture and instead constructs a 14B - parameter autoregressive Diffusion Transformer (DiT). This is the current standard architecture for video - generation models. It predicts videos and actions strictly in chronological order, from left to right, just like a language model generating text. It predicts both videos and actions in the same diffusion forward pass.

This brings two benefits. First, it preserves the original frame rate, and actions and images are absolutely aligned on the time axis. Second, it uses the KV Cache (Key - Value Cache) technology. The model doesn't need to recalculate historical images from scratch every time, greatly saving computing power.

After that, to solve the "error accumulation" and hallucination problems caused by autoregression, DreamZero also introduces real - observation injection.

The model predicts the images and actions for the next 1.6 seconds, and the robot executes them. At the moment when the action is completed, the absolutely real current physical - world image captured by the camera is obtained, directly encoded, and stuffed into the KV Cache to cover and replace the fake image just generated by the model.

This step instantly cuts off the causal chain of error accumulation. The model is forced to always stand on the absolutely real physical foundation to think about the next step.

Finally, and most importantly, is to solve the problem of slow generation.

To meet the frequency requirements for robot control, DreamZero invented the DreamZero - Flash technology. The diffusion model is slow because it needs to go through a long denoising chain during inference. If the number of steps is forcibly reduced (for example, using only one - step denoising), the quality of the generated actions will decline precipitously because the image is still in a noisy and fuzzy state, and the model cannot extract precise actions from it.

The solution of DreamZero - Flash is "decoupled noise scheduling". During training, it no longer keeps the video and actions at the same noise level. It forces the model to predict completely clean and precise action signals while looking at extremely blurred and highly noisy visual images. This is equivalent to training the model to make correct reactions based on physical intuition when it can't see the future clearly.

For humans, this is an impossible task; if you can't see clearly, you can't perform actions. But for the model, this seems to work perfectly. After this training, during the inference phase, the model only needs one - step denoising to generate accurate actions. The inference time is instantly compressed from 350 milliseconds to 150 milliseconds.

This enables the system to output action blocks at a frequency of 7Hz, and combined with the underlying controller, it achieves relatively smooth real - time execution.

After this series of improvements, DreamZero demonstrates the terrifying potential of the video - generated world model.

The most prominent is the generalization ability. In the test of the AgiBot dual - arm robot, the researchers presented tasks that were completely unseen in the training set, such as untying knotted shoelaces, removing a hat from a mannequin's head, and painting with a brush.

If a VLA trained from scratch were to perform these tasks, the task progress would be almost zero, and it couldn't even start properly. However, the average task progress of DreamZero reached 39.5%, and for some specific tasks (such as removing a hat), it was even as high as 85.7%.

This is because the learning process of DreamZero is subversive. During training, it jointly predicts videos and actions, forcing it to establish causal chains of things' evolution in the latent space.