World Model Driven: Embodied Intelligence Bids Farewell to the Era of "Blind Action"
Embodied intelligence is undergoing a silent paradigm shift.
At the beginning of 2026, after the consecutive launches of the spatial perception model, the large embodied model, and the world model, Ant Lingbo open-sourced an embodied world model called LingBot-VA. LingBot-VA pioneered the "autoregressive video - action" world modeling framework, enabling robots to "deduce while acting" like humans. As of now, its task success rate has increased by 20% compared to the internationally top - notch Pi - 0.5. Motus, jointly open - sourced by Shenshu Technology and Tsinghua University, achieved the "see - think - act" closed - loop for the first time. In the tests of 50 general tasks, its absolute success rate increased by 35% compared to Pi0.5.
Almost at the same time, teams from Stanford, NVIDIA, etc. jointly released Cosmos Policy, realizing that "robotic actions can be learned only with a video generation model"; NVIDIA then released DreamZero, which learns physics and skills through "jointly predicting future videos + corresponding actions (strong alignment between visual planning ↔ motion commands)."
The voices in the academic community echo the research directions of the above models from afar. Yann LeCun, a Turing Award winner and a pioneer in deep learning, believes that only when AI has the ability to "predict the future" like humans can it conduct complex planning. The divergence of technical routes has emerged. The Sim - to - Real simulation school and the "Internet data + real - world data" school are progressing in parallel, and the trend of decoupling software and hardware for "one brain, multiple machines" is accelerating. A series of open - source models at home and abroad provide a reusable and verifiable technical foundation for the new paradigm of embodied intelligence research.
01. "Deduce while acting": Solving the "long - term drift" problem of embodied intelligence
The world model originates from the research on "hypothetical thinking" in cognitive psychology. Its core goal is to enable intelligent agents to construct an internal representation of the environment and predict how their own actions will change the environmental state.
Early world models focused on compressing and predicting perceptual signals such as video frames. Modern embodied world models are directly aligned with the rules and constraints of the physical world. This means that a well - trained world model can not only predict what the next frame will be but also understand how a ball will fall when thrown and whether the liquid will spill when a robot arm picks up a cup.
This "ability to predict the future" is precisely the prerequisite for complex planning emphasized by Yann LeCun. At the beginning of 2026, this theoretical concept began to be transformed into a verifiable technological reality.
LingBot - VA proposed the "autoregressive video - action" world modeling framework for the first time, deeply integrating large - scale video generation models with robot control. While generating the "next world state," the model directly deduces and outputs the corresponding action sequence, enabling robots to "deduce while acting" like humans.
Motus unified five mainstream embodied basic model paradigms, including the VLA (Visual - Language - Action model), the world model, the video generation model, the inverse dynamics model, and the video - action joint generation model, into the same framework for the first time, constructing a unified modeling path that connects "perception, reasoning, and action."
Cosmos Policy from NVIDIA and Stanford provides a different technical path. The core of this model lies in its powerful planning ability, which can more accurately predict the consequences of actions. When facing difficult tasks, the model doesn't just give one action. Instead, it first makes suggestions: the planning model first proposes N possible action sequences; then it imagines the future: uses the world model to imagine the future scenarios after each action is executed; next, it evaluates and scores: uses the value function to score these future scenarios; and finally, it selects and executes: selects the action with the highest score to execute.
Experiments show that this model - based planning has a much higher success rate than simply performing actions in highly challenging tasks, with the success rate of complex tasks increasing by 12.5%.
While Cosmos Policy is setting new records overseas, LingBot - VA in China is also leaving the old records behind.
In real - machine evaluations, when facing long - sequence tasks such as making breakfast and picking up screws, high - precision operations such as inserting test tubes and unpacking express deliveries, and flexible object manipulations such as folding clothes and trousers, LingBot - VA only needs 30 - 50 demonstration data to complete the adaptation. Its success rate has increased by an average of 20% compared to Pi - 0.5. At the simulation level, it increased the success rate to over 90% for the first time on the two - arm collaborative benchmark RoboTwin 2.0 and reached 98.5% on the long - sequence lifelong learning benchmark LIBERO, both breaking industry records.
Recently, when COOWA Technology launched COOWA WAM 2.0, it explained the importance of reasoning from another perspective.
In the past decade, the success of deep learning has mainly been based on "self - supervised learning driven by large - scale real - world data." Although the vocabulary and grammar rules of language systems are large, they are ultimately limited, and most new samples fall within the existing semantic manifold. However, in the physical world, the combinations of states and interactions are almost infinite, and the consequences of actions cannot be inferred only from historical co - occurrence patterns.
Due to the irreversibility of the physical world, embodied intelligence cannot conduct an infinite number of trial - and - error attempts in the real world like training AlphaGo. General robots must introduce the ability of counterfactual reasoning - that is, before performing an action, the robot rehearses in its "mind" what the world will be like if it does so. This is the significance of WAM.
The WAM proposed by DreamZero breaks out of the VLA (Visual - Language - Action model) framework. By jointly predicting future video frames and robot actions, it learns the physical dynamics prior of the world from videos, fundamentally solving the problems of poor generalization of physical motion, dependence on repeated demonstration data, difficulty in cross - morphology migration, and lack of spatial perception and dynamics. COOWA Technology is also an advocate of shifting from "action reproduction" to "planning and reasoning," completing a leap from an imitator to a thinker.
02. Internet data + real - world data: A more difficult but more correct path
All signs indicate that "from imitation and execution to thinking before acting" is becoming the consensus direction for embodied intelligence, and another agreement appears in the choice of data routes.
Looking back at the once - mainstream choice of Sim - to - Real (from simulation to reality): first conduct massive training in a virtual environment and then transfer the learned strategies to real machines. The advantages of this path are obvious. Simulation data is cheap and allows for infinite trial - and - error. The bottleneck lies in the "blind spots" of simulation. Physical details in the real world such as fluid dynamics, deformation of flexible objects, and sensor errors are difficult to accurately model in simulation, and the solution period may be longer than reducing the cost of real - world data collection.
The more fundamental problem is that the combinatorial complexity of the physical world is almost infinite, and the consequences of actions cannot be inferred only from historical co - occurrence patterns. Cumulative errors will be continuously amplified in long - sequence decisions.
Shen Yujun, the chief scientist of Ant Lingbo, said bluntly, "Sim - to - Real is not our main technical route." Ant Lingbo's solution is Internet data + real - world data.
"We found that using data from the physical world for an additional layer of pre - training is very helpful for improving the capabilities of embodied models," Shen Yujun said. This strategy has been verified on LingBot - VLA. Based on pre - training with over 20,000 hours of high - quality real - machine data from nine mainstream configurations, the model outperformed a series of internationally top - notch baselines in authoritative evaluations.
COSMOS Policy also provides key evidence. This model is a SOTA robot control strategy with a video generation model as the backbone, transforming video generation ability into action control rather than pre - training based on image - text.
The team conducted extensive tests in both simulated environments and the real world. Specifically in the LIBERO list of simulated environment tests, COSMOS Policy achieved an average success rate of 98.5%, breaking the record. On real robots, Cosmos Policy challenged four high - difficulty tasks and performed well. Especially in the most difficult task of "putting candies into a sealed bag," it could accurately grasp the edge of the bag. This proves that the best robot brain may be a video model that has "watched" thousands of video tapes.
At the same time, the combination of Internet data and real - world data makes the Scaling Law manifest in the physical world.
The research on LingBot - VLA shows that when the training data expands from thousands of hours to the level of 20,000 hours, the generalization ability of the model shows a significant leap, such as an increase in the success rate across tasks/objects/environments. This means that embodied intelligence is no longer relying only on "manual parameter tuning + single - point demos" but is moving towards an engineering path of "scalable training + transferable base."
The real - machine tests of Motus also prove this law.
In the Data Scaling experiment, compared with the internationally leading VLA model Pi0.5, Motus can learn from a wider range of data types and effectively integrate the prior abilities provided by more pre - trained base models. In the average success rate of 50 tasks, Motus achieved an absolute success rate increase of 35.1% compared to Pi0.5. At the same time, it showed a 13.55 - fold data efficiency at the same performance level. By introducing richer and more heterogeneous multimodal priors, Motus can more efficiently form more general intelligent abilities under the action of the Scaling Law.
The test results of LingBot - VLA, Motus, and COOWA WAM all point to one thing: the feedback closed - loop in the real world is indispensable. This also explains why several large models have recently adopted the mechanism of "deducing while acting": instead of rote - memorizing simulation data, they are trying to understand physical laws.
03. Open - source paving the way: Has the "Android moment" of embodied intelligence arrived?
Technological breakthroughs are only the first half; the reconstruction of industrial division of labor is the end - game.
The traditional robot industry is deeply trapped in the dilemma of "reinventing the wheel." Every time a new task or a new body type is involved, new data needs to be collected and parameters need to be adjusted, resulting in extremely high engineering costs. Razzag, the editor and CEO of MarkTechPost, one of the Silicon Valley AI industry news hubs, pointed out that this is precisely the core obstacle preventing robots from moving from pilot projects to large - scale deployment: "The hidden costs of 'repeated training/retraining' have been seriously underestimated."
"One brain, multiple machines" and the decoupling of software and hardware are breaking this deadlock.
"We focus more on the development of basic models. From the beginning, we firmly chose the cross - configuration path and cooperated deeply with relevant data providers in the industry to meet the need for data diversity in model training," said Zhu Xing, the CEO of Ant Lingbo.
Currently, the LingBot - VLA base of Ant Lingbo has been adapted to nine mainstream robot configurations, proving the feasibility of cross - body migration. This means that a few general embodied brains can drive various types of robotic arms, chassis, and dexterous hands. Small and medium - sized hardware manufacturers can focus on the accuracy and durability of actuators, and obtain software capabilities by calling the base. The R & D paradigm has shifted from each company training from scratch to making adaptations on the base.
This trend is highly similar to the "Android moment" of smartphones: the operating system layer unifies software and hardware interfaces, and the application layer releases innovation vitality. The difference is that the operating system of embodied intelligence has not been finalized yet, and open - source has become a key variable to accelerate convergence.
Ant Lingbo's strategy is "saturated open - source": it continuously released four core models within a week, namely LingBot - Depth (spatial perception), LingBot - VLA (intelligent base), LingBot - World (world model), and LingBot - VA (embodied world model). By providing an efficient "post - training toolchain," it enables hardware manufacturers to adapt the "brain" to their own "bodies" with lower data volume and GPU (graphics processing unit) costs. Zhu Xing's logic is clear: "In the early stage when the route has not converged, open - source is the optimal solution to promote industry progress."
The international open - source ecosystem has also responded positively. The Cosmos Policy jointly released by NVIDIA and Stanford has open - sourced the model and code; NVIDIA's DreamZero has open - sourced the model weights, inference code, and the running code for real - world/simulation benchmarks. Google has opened the Genie 3 experience platform.
These efforts resonate with Ant Lingbo and Tsinghua Shenshu's Motus in the global open - source community, jointly promoting the construction of the technical infrastructure for "world - model - driven embodied intelligence."
"The system has been verified on the Franka Research 3. LingBot - VLA demonstrates how to combine video - based prediction with robots like Franka, enabling machines to learn, adapt, and reliably perform complex tasks," announced the robot company Franka through its official account.
In China, LingBot - VLA has been adapted to robot manufacturers such as Xinghaitu, Songling, and Leju, verifying the cross - body migration ability of the model on robots of different configurations.
"The real world is complex, full of physicality and causality, which is hardly touched by today's large language models (LLMs). True intelligence must be able to conduct deductions in the 'mind' like humans. Only when AI has the ability to predict the future can it conduct complex planning," Yann LeCun has pointed the way.
To turn this vision into reality, the industry needs path - makers. From NVIDIA/Stanford's Cosmos Policy and DreamZero to Tsinghua's Motus, the open - source community is jointly tackling the toughest problems and paving the broadest road. The infrastructure of embodied intelligence is taking shape at an accelerated pace through global collaboration.