Wang Xingxing makes tough remarks: Everyone around the world is learning from G1. It will still be a classic 20 years later. He reveals for the first time the secrets behind the Spring Festival Gala robots.
On March 17th, Zhidongxi reported that today, Wang Xingxing, the founder, CEO, and CTO of Unitree Technology, introduced the latest key technological advancements of Unitree Technology in embodied intelligence at GTC2026. He also shared his views on the core bottlenecks that embodied intelligence faces in task generalization, data efficiency, and the scale - effect of reinforcement learning.
Wang Xingxing believes that although embodied intelligence has become one of the most globally - focused technological tracks in the past two years, the industry is still significantly far from reaching the "ChatGPT Moment". The current biggest challenge is that there isn't a truly embodied intelligence model with strong generalization ability that can stably execute tasks in unfamiliar scenarios.
He predicts that this critical point can be achieved as soon as 1 - 2 years from now, or at most 2 - 3 years in the future.
Wang Xingxing emphasized that both the movement ability and the task - performing ability must be advanced simultaneously, but the former is still a prerequisite for the latter. Without a rich enough range of motion expressions and sufficient stability, it's difficult for robots to truly enter factories, households, and more real - world scenarios.
However, in his opinion, what truly restricts the industry from crossing the critical point is no longer just the performance of a single product or a single action, but the systematic capabilities at the model level. Wang Xingxing proposed that to reach the "ChatGPT Moment" of embodied intelligence, the industry still needs to solve at least three things:
Firstly, enhance the model's ability to express tasks and actions to break through the generalization bottleneck; secondly, improve the utilization efficiency of diverse data such as videos, simulations, and real - machine data to reduce the reliance on large - scale real - machine data collection; thirdly, enable reinforcement learning to form a reusable and accumulative scale - effect.
In terms of technological route judgment, he is optimistic about the world model and the video - generation model, believing that this route has a higher ceiling and more opportunities to utilize the vast amount of video and text data on the Internet.
Wang Xingxing thinks that if in the future, robots can complete 80% of tasks in 80% of unfamiliar scenarios only through verbal or written instructions, it will mean that embodied intelligence has truly entered its "ChatGPT Moment".
The following is a compilation of Wang Xingxing's speech content (Zhidongxi made some additions, deletions, and modifications without changing the original meaning):
01.
Twenty years from now, the G1 will still be a classic product
Unitree was founded in 2016. Even earlier, I started working on quadruped robots around 2013. I actually started on humanoid robots even earlier. In 2009 when I was still in college, I made a small humanoid robot.
In recent years, our company has successively developed several models of humanoid robots. One of the currently well - known models is the small - sized humanoid robot G1, which was released in May 2024. In a sense, it has become a very classic model globally. Many users at home and abroad are using it, and even many other companies are researching and learning from its design.
The most prominent features of this robot are its small size, compactness, and high cost - effectiveness. It is about 1.3 meters tall and weighs dozens of kilograms. It has a high degree of freedom in its legs and hands, and its sensor configuration is relatively complete. The overall compactness of the robot is very high. Therefore, its usability and aesthetic appearance are excellent globally. Even looking back ten or twenty years from now, this robot will still be a classic product.
Last year, we also released a medium - sized industrial - grade robotic dog with strong performance indicators, which can perform tasks such as indoor and outdoor inspections.
At the same time, we released a large - sized humanoid robot H1 with a height of 1.8 meters. This robot has better overall proportions, looks more human - like, and has good flexibility. Of course, due to its larger size, it is more suitable for physical - type work at present, such as in factory and agricultural scenarios.
Some time ago, we also released a small robotic dog As2. It has basic waterproof ability, strong load - bearing capacity, and can carry a load of more than ten kilograms. It also has a relatively long battery life. In terms of hard indicators, this product is currently at a leading level globally. We hope that it can truly help people with practical tasks in the future. For example, when hiking or traveling, people don't need to carry their own bags, and with its help, the process will be much easier and more convenient.
The reason we continue to develop larger - sized humanoid robots is that small - sized robots are naturally limited in terms of support ability, load - bearing capacity, and arm strength.
If robots are to enter factories and households, especially for more physically - demanding work, their size, strength, and structure need to be further improved. For this reason, while large - sized robots have stronger capabilities, they also require higher safety standards.
Currently, these large - sized humanoid robots can learn and perform some relatively complex engineering operations. However, due to their heavier weight and greater strength, a sufficient safety distance must be maintained when in close contact. A distance of at least two to three meters is safer. Otherwise, being hit by an arm or a leg may cause injury.
02.
For large - scale application of robots, stability must be good enough
We have done a lot of work on the motion performance of robots over the years.
Our humanoid robot H1 has achieved many good results in terms of motion ability. For example, it can run 1500 meters in about six minutes, and an average person may not be able to keep up with it. Of course, its sprinting ability still needs to be improved.
In addition to hardware, we also carried out many software upgrades last year, such as automated control, anti - impact ability under any motion, and the ability to stand up autonomously after falling. These technologies are very useful.
We believe that for large - scale application of robots in the future, the most important thing is that stability must be good enough. Even in extreme situations, the robot should be able to recover and stand up on its own.
Currently, the algorithms of our robots have strong adaptability to hardware, so their generalization performance on different models is relatively better. In theory, many actions that humans can perform, robots can now try to complete.
Of course, some particularly complex actions still pose challenges. For example, actions with large lateral forces or on slippery surfaces are difficult. But overall, we hope to continuously enhance the motion ability of robots.
Last year, we made many upgrades to the RL control model, including basic running, dance moves, martial arts moves, and the rapid recovery and stable control of the robot after being disturbed during any action. In the second half of last year, we also implemented relatively complete full - body teleoperation.
I think that many core problems in full - body deep reinforcement learning have basically been solved, and the next step is to continue to improve.
03.
Behind the Spring Festival Gala performance, it's not just single - action training, but the whole system's ability
In February this year, the Spring Festival Gala program we participated in received very enthusiastic feedback both at home and abroad. To prepare for this program, I sorted out almost all the traditional Chinese martial arts moves I could find. Initially, I found about a hundred moves, then selected the more expressive and robot - suitable ones, and finally kept dozens of them, including classic moves such as drunken boxing, nunchaku, stick - dancing, and sword - dancing.
At the same time, we also challenged some high - difficulty moves. For example, continuous somersaults in place put a great load on the motors and legs. For the wall - climbing move, we also aimed for higher difficulty, not just single - step climbing, but more visually - impactful moves.
In the stick - dancing part of the program, we used dexterous hands so that the robot could grasp the stick. In addition, larger - sized humanoid robots also made special formations and displays at the branch venue, which were very interesting and meaningful.
We made some modifications to the robot for the program.
For example, we replaced the head lidar with a 128 - line 3D lidar and adjusted its orientation so that it could obtain more information about the surrounding environment. Since lidar only looks downward or to the side, it is easily blocked in scenarios with many people and robots.
In addition, we used a pre - trained full - body RL model instead of training a single RL model. The advantage of this approach is that it has stronger composite ability, is more convenient for training and debugging, and is more conducive to rapid movement, complex action combinations, and compatibility between different hardware.
To put it simply, when performing complex actions, in theory, we can make the robot stop instantly, and then switch to the next action immediately after stabilizing. With earlier technological routes, many single - action strategies could not be paused and switched in the middle, and the robot might fall if it stopped. Now it can stop stably and then switch actions, which is very helpful for debugging and combining various complex actions.
In addition, we also developed a full - body state perception model to enable the robot to better complete perception and decision - making during actions; at the same time, we created a cluster control system that can mobilize dozens or even hundreds of robots to complete complex movements and formations.
04.
Movement and task - performing abilities must be advanced simultaneously
We have always believed that both the movement ability and the task - performing ability are very important and must be advanced simultaneously. In a sense, the movement ability is still a prerequisite for the task - performing ability.
For a robot to perform tasks, it must meet several conditions. First, its motion expressions must be rich enough to perform a variety of actions; second, it must be stable enough when performing these actions. If it fails to meet these two points, it's difficult to talk about real task - performing ability.
This is similar to animals. For example, ants, mice, and dogs may not have highly developed brains, but their movement abilities are still very strong. So I think that in a way, movement intelligence is a relatively easier - to - achieve ability and a necessary first step. First, develop the physical ability, and then improve the "brain" and the "task - performing model". This is the necessary path.
In the past few years, we have been promoting the idea of robots performing tasks, but objectively speaking, this is still very difficult globally.
We have always hoped that robots can produce other robots. So some time ago, we were also developing relevant models and trying to apply them to humanoid robots, allowing them to enter factories to produce humanoid robots. I think this is very interesting and meaningful.
Of course, at present, for particularly complex workstations, such as assembling joint modules, due to the large number of components and complex processes, the success rate is not very high. However, for tasks such as grasping a single component or relatively simple actions composed of one or two components, after training, the success rate can be close to 100%.
Globally, complex operations involving multiple processes, long task chains, and small components are still very challenging.
In addition, one of the technologies we developed well in the second half of last year was full - body teleoperation. This ability is very practical, especially for large - scale data collection.
Of course, the current teleoperation solutions still have some common problems globally. For example, when the robot moves, there is still a gap between the action completion degree and that of a real person. In some complex actions, the feet and body may shake, which affects the overall operation experience. These aspects need to be further improved.
However, in terms of stability, this solution has been relatively well - developed. The publicly - shown videos are not speed - up, but real - time recordings.
05.
To reach the "ChatGPT Moment", at least three key problems need to be solved currently
If we want to discuss how embodied intelligence can reach the "ChatGPT Moment", I think there are at least several key problems.
First, improve the model's ability to express tasks to break through the generalization bottleneck.
Many current models are not strong enough in "expression". They may only be able to perform some basic actions. If they are required to perform arbitrary actions, generate actions in real - time, or perform more advanced and complex actions, the models themselves are still difficult to fully express.
If a model cannot even express actions, it is even more impossible for it to execute them with high quality. Therefore, the model's motion expression ability must be enhanced. In this regard, multi - modal models, perception ability, and the encoder and decoder of the model itself all need to be further improved.
Second, improve the model's utilization rate of diverse data.
Robots are different from language models. Robot data is still very scarce at present. In the case of very limited real - robot data, if a model can only be trained by relying on a large amount of real - machine data, I think the data utilization rate is still relatively low.
Therefore, we should use as much video data, Internet data, and simulation data as possible during the pre - training stage to train the basic model first, and then improve the utilization efficiency of real - robot data. In this way, less real - machine data is needed, but the system can still operate.
Even if you have ten thousand robots and send ten thousand people to collect data, the final result may not be good. Because there are many problems such as data quality, hardware differences, and sensor differences. It doesn't mean that more robots will linearly improve the data effect. So I always think that we need to further improve the data utilization rate, use as much video data and simulation data as possible, and reduce the reliance on large - scale real - machine data collection as much as possible.
Third, enhance the scale - effect of reinforcement learning.
In many cases now, once the motion strategy of a robot is trained, the data is discarded. When training for new actions, it has to be retrained. Ideally, this data should be collected and reused in a unified model, continuously reused and accumulated, so that reinforcement learning can also achieve a scale - effect similar to "the more you train, the stronger it gets". If this can be achieved, the benefits of reinforcement learning will be very obvious.
06.
The world model or video - generation model is the future development direction
In recent years, many technological routes have emerged in the field of embodied intelligence, such as the classic VLA model, as well as models based on video generation and video world models.
Personally, I think the more promising future direction is the world model or the model based on video generation. Because this route has a higher ceiling, and in a sense, we still can't see where its ceiling is.
The reason is simple: if we take this route, the robot model can make more full use of the large - scale video and text data already available on the Internet, rather than relying only on the real - machine data it collects itself. Its data foundation is naturally larger