What reference does Wang Xingxing of Unitree Technology's "extreme remarks" have for intelligent driving?
"The VLA model is a relatively simplistic architecture."
On August 9, 2025, at the 2025 World Robot Conference held in Beijing, Wang Xingxing, the founder, CEO, and CTO of Unitree Technology, said this during his speech.
Although he expressed this view regarding the embodied intelligence large model, it's quite astonishing when it comes to the current hottest model direction in intelligent driving.
Even Huang Guan, the CEO of Jijia Vision, criticized his view as "too amateurish."
Wang Xingxing believes that the world model might be a better technical direction. However, in the short - term future of 2 to 5 years, "the most prominent one will definitely be an end - to - end embodied intelligence AI model."
At the conference, he sorted out and analyzed the development trend of embodied intelligent robots from three aspects: core bottlenecks, emerging technology engines, and future technology focuses. Let's see what inspiration we can get from the views of this well - known figure.
Core Bottleneck: The Model Isn't Good Enough
When talking about the reason why robots haven't been widely applied, many people mistakenly think it's due to insufficient hardware performance or high costs. But Wang Xingxing pointed out that the current robot hardware (including the dexterous hands and the whole body of humanoid robots) is basically sufficient.
From a technical perspective, the hardware of humanoid robots, such as dexterous hands and the whole body, can meet the basic requirements. Although there are still many challenges in engineering implementation, it can support the realization of basic functions.
He believes that the core bottleneck restricting its large - scale application lies in the immaturity of the large AI model for embodied intelligence.
Wang Xingxing believes that the current development stage of the large robot model (embodied intelligence) is similar to 1 to 3 years before the release of ChatGPT. That is, the industry has clearly defined the direction and technical route, but has not yet broken through the critical threshold.
In Wang Xingxing's view, the reason for not reaching the critical threshold is mainly because the industry pays too much attention to "data" while neglecting the problems of the model itself.
Wang Xingxing believes that the key problem in the development of embodied intelligence is that the model architecture is not perfect enough, lacks unity and generality, resulting in limited capabilities, and the data cannot be fully utilized.
Taking the currently more - concerned VLA model as an example, Wang Xingxing believes it is a "relatively simplistic architecture." In real - world interactions, it relies too much on data quality but has insufficient adaptability. Therefore, he is skeptical about the application prospects of the VLA model.
In addition, "VLA model + RL training" is also a common optimization idea in the industry, but he believes that practice shows that this is still not enough. "The model architecture must be further upgraded and cannot stay at the level of simple combination," Wang Xingxing said.
In Wang Xingxing's view, another factor restricting development is the lack of the "RL Scaling law," which has led to robots not being able to break the curse of "starting from scratch." Wang Xingxing believes that currently, when robots learn new tasks, such as learning a new dance or completing a new job, they often need to be trained from scratch, which significantly reduces training efficiency. This is caused by the lack of the "RL Scaling law" in robot control.
In Wang Xingxing's view, the ideal state of embodied intelligence is that "training for new tasks is based on the old foundation, with increasing speed and better results." This law has been fully verified in language models, and in the field of robot motion control, it is still in its infancy, but it shows great potential and is a key area worthy of in - depth exploration by the industry.
New Technology Direction: Video Generation Model
Since the VLA model is not excellent enough, then what model is the direction?
Wang Xingxing believes that at this stage, the route of the video generation model might be faster and have a higher probability of convergence than the VLA model.
The core logic is: use the video generation model to "simulate and generate a video of the robot's action sequence" in advance, and then directly guide the physical robot to perform the corresponding actions. For example, if the instruction is "tidy up the room," the model can first generate a virtual video of the robot tidying up the room, and then convert the actions in the video into control signals for the physical robot.
However, Wang Xingxing pointed out that there is also a real - world problem with this route: the current video generation model pays too much attention to "video quality," resulting in high GPU consumption; but for robots, high - precision videos are not necessary, as long as they can drive the actions. Currently, this contradiction still needs to be resolved.
Future Technology Focus: Model, Hardware, and Distributed Computing Power Network
Wang Xingxing predicts that in the next 2 - 5 years, the development of embodied intelligent robots will focus on three major directions:
First, a unified end - to - end intelligent robot large model. The end - to - end model is the key to improving robot capabilities. In the future, the research and development of the end - to - end model should be emphasized to achieve "rapidly learning new skills based on the existing training foundation" and improve the model's generality and efficiency.
Second, lower - cost, longer - life hardware and mass manufacturing. Hardware optimization is also an indispensable part. Even the automotive industry, which has developed for a century, still needs to continuously overcome numerous engineering challenges. For humanoid robots that may reach the scale of "millions or tens of millions" in the future, the engineering challenges of "low cost, long life" and "ultra - large - scale mass manufacturing" must be solved to support large - scale applications.
Third, a low - cost, large - scale distributed computing power network. The robot body is limited by its size and battery capacity and cannot deploy large - scale computing power because "its peak power consumption is usually only about 100 watts, equivalent to the computing power of several mobile phones."
Therefore, in the future, a distributed computing power network needs to be built. For example, in an industrial scenario, a local server cluster can be deployed in the factory for about 100 robots to connect nearby, reducing communication latency; in a civilian scenario (such as a residential area), a regional - level computing power cluster can be established to reduce users' computing power construction costs while ensuring latency and security.
In an interview after the conference, some media mentioned the price expectation of robots. Regarding this, Wang Xingxing said that when robots have large - scale operation capabilities, they may even be free because "each robot can pay taxes after leaving the factory."
For this, he gave an example: What kind of work does a robot do? Taxes can be directly deducted from the value it creates. For instance, if there is a barren land and an enterprise sends robots to reclaim and cultivate it, part of the value created by the robots will be directly converted into taxes.
"This process may take 2 - 3 years if it's fast, or 3 - 5 years if it's slow, but I think this wave (of development) will probably not exceed 10 years," Wang Xingxing said.
Wang Xingxing's speech has sparked quite a bit of controversy. In the intelligent vehicle industry, currently, VLA + RL is the hottest direction, and many enterprises such as Li Auto, XPeng, Huawei, and WeRide have adopted this route or a similar one. At the same time, Huawei, NIO, Li Auto, and XPeng have also adopted the world model, but the expressions and functions are different. Some are only used for simulation training, while others are directly expressed as the base model for autonomous driving models.
Of course, the development logic of embodied intelligence may not be the same as that of intelligent driving. Wang Xingxing's opinion is just one person's view, and the subsequent competition in technical routes still needs to be judged in actual combat.
This article is from the WeChat official account "Cyber - car" (ID: Cyber - car), author: Wang Lingfang, editor: Qiu Kaijun. It is published by 36Kr with authorization.