Wang Xiaogang and His "World Model": One Person Controls Ten Dogs, First Let Quadruped Robots Take to the Streets to Work | Exclusive Interview with Intelligent Emergence
Text | Fu Chong
Editor | Su Jianxun
Four days ago, "Daxiao Robot" posted a video on Xiaohongshu with the title: Teacher Wang Xiaogang has ten dogs.
In the video, Wang Xiaogang, the chairman of Daxiao Robot and also a co - founder of SenseTime, stood behind ten robotic dogs of different forms. Without a remote control in hand, he waved his hand and said, "The task has been issued. Set off."
The robotic dogs responded immediately: some went to the road surface to look for illegally parked vehicles, took photos and sent them back; some went to the urban no - fly zones to detect illegal drone signals and issued voice warnings after finding the operators.
"In the past, one dog might require two or three staff members to 'attend' to it. In the future, one person in the remote control room can manage a team," Wang Xiaogang described.
At the press conference of "Daxiao Robot" on December 18th, Wang Xiaogang also presented the application scenarios for Daxiao robotic dogs: they can serve as robotic "urban management officers" for street patrols. Currently, they are discussing this new urban governance plan with the Xuhui Public Security Bureau.
△ The four - legged robotic dogs (referring to a popular children's cartoon about a rescue team of dogs) setting off on a mission, from different body brands, all equipped with Daxiao's Embodied Super Brain Module A1 on their backs. Image source: Provided by the enterprise
Wang Xiaogang attributed the credit for "making the dogs suddenly able to work" to two new releases:
One is the Embodied Super Brain Module A1, which is like a smart AI brain and can be installed on different body brands such as Unitree, Zhipu, and Deep Robotics. After installing the A1 module, the robotic dogs, which originally only had movement capabilities, also gain "spatial intelligence" and "autonomous decision - making" abilities.
The core that drives this brain is another release this time - the "Kaiwu" World Model 3.0. Simply put, the world model establishes the operating laws of the physical world within the AI model. With it, it's like putting the ability to interact with the world into the robot's brain.
In this way, the robot can learn different tasks in the physical world more quickly and adapt to new environments it has never been to. Just like after learning to "open the door", whether it's the entrance door at home or the door of a restaurant visited for the first time, it can open them.
In addition, the world model can also be applied to different robots. Robots with various configurations such as four - legged dogs and two - legged humanoids can all have the ability to understand the world and predict subsequent states through the world model.
However, the world model is not a concept that comes out of thin air. Its rise points directly to the essential bottleneck encountered by the mainstream technology of embodied intelligence, the VLA model, in the past year:
The VLA is more like a "super imitator". It relies on a large amount of paired data of "picture - instruction - action" to let the robot learn specific skills. But it is difficult for it to truly understand physical laws. So when the environment or the object changes, the success rate will decline.
Therefore, the VLA needs to pile up a large amount of data to let the model "see" different cases to complete more and more tasks. But the current data volume is difficult to sustain. Autonomous driving can easily accumulate millions of hours of driving data, while embodied intelligence still requires staff to remotely control robots to collect data, and it is still stuck at the level of 100,000 hours.
The world model allows the robot's brain to shift from "memorizing example questions by rote" to "mastering general formulas", thus greatly reducing the dependence on specific scenarios and a large amount of real - machine data.
At the press conference site, "Intelligent Emergence" tried out the "Kaiwu" World Model 3.0: Just input a text description, then select the camera position, different robot bodies and other information, and the world model will generate action pictures from the first - person perspective of this robot.
These generated pictures and action decisions can teach the robot's brain how to interact with the physical world and direct the robot to complete each action behind the scenes.
△ During the on - site trial, the "Kaiwu" World Model 3.0 can generate pictures according to the user's text description of space and actions input on the right. Image source: Taken by the author
For this reason, the world model has become a hot technological trend recently. Including Tesla, in recent technology sharing, more and more intelligent driving and embodied intelligence companies have shown the progress of their world model layouts.
But Wang Xiaogang also emphasized that for the world model to be truly effective, there must be a closed - loop of downstream verification.
He recalled that in November 2024, he led the release of a world model for intelligent driving, but at that time, the industry's attitude towards this technology was "not very confident".
The reason was that many companies, including NVIDIA's Cosmos world model, regarded the world model as a "data generator" at that time. Although they could generate a bunch of seemingly valid scenario pictures in the laboratory, there was a lack of downstream real - world verification. No one could answer "whether these data are really useful", and it was difficult to build trust.
Wang Xiaogang's solution was to integrate the released intelligent driving world model into his own stop - driving algorithm business. For example, in the cooperation with SAIC IM, this ability was used to tackle high - risk game scenarios such as "going through roundabouts" and "being cut off by large vehicles".
In the past, collecting such data was dangerous and expensive, and sometimes even required coordinating "actor cars" to go on the road to reproduce the scenarios. SenseTime can first generate a large number of scenario pictures and solution strategies in the world model on a large scale, and then use SAIC IM's real cars to test and calibrate the decisions of the world model, so that the model's ability becomes more and more accurate through real feedback.
Applying the same methodology to embodied intelligence, Daxiao chose to use "robotic dogs on the street" as the first stop for commercialization: the hardware of four - legged dogs is more mature, and the commercialization path to enter the scenario is shorter. They can verify the ability of the world model during task execution and continuously iterate in real scenarios.
Wang Xiaogang also presented Daxiao's commercialization roadmap: first, let the four - legged robots run in the road world to explore the incremental market that four - legged robots have not fully developed; in 2 - 3 years, extend the business to unmanned logistics warehouses through wheeled dual - arm robots; later, consider two - legged humanoids and more complex home scenarios.
In this process, Daxiao doesn't start from scratch. The 11 - year accumulation of SenseTime brings reusable resources to the commercial implementation of Daxiao Robot.
For example, SenseTime's "Ark" vision platform has been applied to a large number of event detection applications in the city, which makes it possible for Daxiao to quickly enter scenarios such as security and inspection. In addition, SenseTime's layout in the overseas market also provides a ready - made channel for Daxiao Robot to be sold to other countries in the future.
Recently, "Intelligent Emergence" conducted an exclusive interview with Wang Xiaogang, discussing his judgment on the world model and the technical details of Daxiao. The following dialogue has been sorted out by the author.
△ Wang Xiaogang, the chairman of Daxiao Robot. Image: Provided by the enterprise
Track Upgrade: From VLA to World Model
Intelligent Emergence: Regarding the "upgrade" from VLA to the world model, do you think it is a gradual evolution in the same technological direction or a major turning point?
Wang Xiaogang: This line is consistent. I regard the world model, end - to - end, and reinforcement learning as extensions of the same technological link at different stages.
From autonomous driving to embodied intelligence, the core is to let the model understand and predict the evolution of the real world, and then use this ability for decision - making and control.
The change in the industry is that people start to regard "whether the model can work in a closed - loop in the physical world" as the primary issue, rather than just doing a few demonstration actions.
You can also see in some details recently disclosed by Tesla that the world model is used as a simulator, which is the result of the technological development up to now.
Intelligent Emergence: You said that you led the release of a world model last November, but at that time, people "didn't believe" in the world model. Later, SenseTime used SAIC IM's intelligent driving business for verification. What exactly was verified?
Wang Xiaogang: SAIC IM will select high - risk and high - complexity scenarios to verify the ability of our world model, such as game problems like going through roundabouts and being cut off by large vehicles.
In the past, collecting real data in these dangerous scenarios was dangerous, costly, and sometimes even required finding actors to create the scenarios. But after using the world model, more data and strategies for such scenarios can be generated, which helps intelligent driving improve the ability to handle corresponding tasks.
Intelligent Emergence: What shortcomings of the VLA does the world model solve?
Wang Xiaogang: The VLA is more inclined to the learning of short - sequence actions and skills and usually does not carry the injection of complex physical laws and long - chain reasoning. Due to the lack of a structured understanding of the physical world, it is also prone to "doing some actions that seem right but are ineffective".
The goal of the world model is greater. It learns the laws of the environment and interaction, supports prediction, reasoning, and planning, and can generalize in different tasks and scenarios.
For example, after the VLA learns to open the door of a white refrigerator, it may not recognize it when the refrigerator is black. The world model can understand how the refrigerator door is opened, so no matter if it's a different room or a refrigerator with a very different appearance, it still knows the physical laws inside.
We also hope to put the world model on the edge side as much as possible, which can also improve the synchronization efficiency of the robot from thinking to execution.
Intelligent Emergence: Why do you emphasize that "the world model should be combined with reinforcement learning"?
Wang Xiaogang: Reinforcement learning is good at finding strategies in an environment where repeated trial - and - error is possible. But the cost of trial - and - error in the real world is too high. So part of the trial - and - error and deduction can be moved to the world model, and then the strategies can be transferred back to the real machine.
Intelligent Emergence: What is the difference between Sora, a generative world model, and the embodied world model launched by Daxiao?
Wang Xiaogang: Sora is an excellent video generator, but in essence, it is a "black box". The videos it generates may look very real and cool, but the model inside does not understand the physical relationships and causal laws between the objects in the video.
Sora cannot break down the objects in the scene into interactive and replaceable objects for editing. For example, in the picture, the bottle, the table, and the surrounding environment are stuck together as a whole "background". You cannot take the bottle out separately, change its position, and let it have real interactions with other dynamic objects.
The embodied world model aims to solve another type of problem: it is not for generating a nice - looking video, but for enabling the robot to reason, plan, and make decisions in the real world.
For example, if there is a pile of building blocks on the table and you ask the world model to control the robot to build them into the shapes of the three letters "ACE" as quickly as possible. In this task, the robot has to first understand the position, shape, and mobility of each building block, and deduce an optimal moving sequence: which one to move first, which one later, and what grasping method to use to complete the task with the fewest steps.
Intelligent Emergence: So what abilities does the world model launched by Daxiao have to help embodied intelligence better perform tasks?
Wang Xiaogang: So the embodied world model we developed should include three multi - module abilities:
First is multi - modal understanding, to understand the world itself, not only the content of the video, but also deeper things such as camera pose, 3D trajectory, and mechanical properties;
Second is multi - modal generation, to be able to generate trainable data and scenarios, such as changing the background, the body, or the robotic arm in a generated world picture;
Third is multi - modal prediction. For example, if I issue the instruction "Pick up the phone", it should be able to predict that using the left hand and the right hand will result in different action trajectories.
Moreover, our platform allows users to choose different robot bodies. Because ultimately, you want the robot to "go to work" - when you generate simulation data and build training scenarios, you need to correspond to the specific body to truly connect the world model to the downstream training closed - loop.
Intelligent Emergence: How do you judge whether a world model is good or not?
Wang Xiaogang: There are some benchmarks in the industry, but I value influence and the ability to solve problems in applications more.
Just looking at the rankings is not enough. It depends on whether it can be combined with the robot system, be widely used in real problems, and continuously iterate. We will also open - source the world model to let everyone use it. More usage and the ability to solve problems are themselves a more rigorous evaluation system.
△ The robotic dogs equipped with Daxiao's module can recognize red lights at intersections, achieve autonomous navigation and obstacle avoidance. Image source: Provided by the enterprise
The Data Methodology of the World Model
Intelligent Emergence: What kind of architecture does the "Kaiwu" World Model 3.0 have? Where does the training data come from?
Wang Xiaogang: We divide the architecture into three levels, and different levels collect different data.
1) The bottom - most level is the description of the world. For example, why do apples fall when they are ripe? What are the physical laws behind it? These descriptions of the physical laws of the world are all in text form.
2) The second level is human behavior, that is, how humans interact with the physical world. The model needs to understand how the pose changes when the robot interacts with the physical world, what kind of force is applied, what the tactile sensation is, etc.
This is data collection with humans as the main body. For example, let people wear cameras on their heads to shoot first - person perspective videos; or let people wear data - collection gloves to capture hand movements; there are also cameras around to shoot from the third - person perspective. Record the actions of humans.