HomeArticle

Embodied intelligence is booming in China, and it is no longer simply following in Musk's footsteps.

数智前线2025-07-02 16:48
Large models meet robots.

The first wave of embodied intelligence implementation in China has quietly begun, with different scenarios and technical routes from those overseas.

On the production line of a large household appliance factory in Shandong, several white robotic arms are busily working, precisely dropping between welding points, and the metal frames of high - end washing machines are being assembled. A few months ago, this process required engineers to manually debug for several days. Now, eight embodied intelligent robotic arms are connected to the "digital brain", and they can complete the full adaptation to the new - model washing machines within four hours.

"The household appliance manufacturers have a high acceptance level. These arms cost a total of hundreds of thousands of yuan, and they have indeed improved production efficiency." A product personnel from Hualong Xunda told Digital Intelligence Frontline. The "brain" of this system comes from Huawei Cloud's Pangu multimodal large - model, which is responsible for task decomposition and planning, while the "cerebellum" is independently developed by Hualong Xunda based on an open - source model and is responsible for specific operations. "The data of the production line is scarce. Next, we need to let it learn while running in actual production to make it smarter."

Spot - welding scenario demonstration

Behind this is an attempt to reconstruct industrial flexible manufacturing with embodied intelligence. At two conferences held around June - the Beijing Zhiyuan Conference and the Huawei Developer Conference, embodied intelligence became the focus. What the participants saw were no longer robots performing repetitive and single movements, but rather "new species" that could gradually adapt to changes, make decisions, and actively execute tasks. The industry is witnessing an intelligent leap.

However, this leap is far from reaching its end. Wang Zhongyuan, the dean of the Beijing Zhiyuan Research Institute, said that the embodied large - model is still in the technical exploration stage "before GPT - 3". "Directions such as simulation data, reinforcement learning, and the integration of the 'brain' and 'cerebellum' are still being explored, and no unified methodology has been formed yet. There are still many hurdles to overcome for industrial implementation."

"Our industry is not a floating industry." Wang He, the founder and CTO of Galaxy Universal, said. "If we only tell stories and don't implement them, it will cause great harm to the industry in the long run. We need the academic and industrial circles to work together to really do several things well."

01 The Chinese manufacturing industry will witness the "embodied intelligence" transformation

The first wave of domestic industrial implementation has quietly begun in multiple manufacturing and service scenarios. Their application scenarios are more diverse and even more complex than those of overseas giants such as Tesla.

Take a look at the following video. The robotic arm is installing a precision optical fiber.

In a demonstration jointly developed by Huawei Cloud and Huawei's Manufacturing Department, a dual - arm robot is completing the last step of the "color box packaging" of mobile phones. This process is currently still entirely manual, and an attempt is being made to use embodied intelligence to complete it.

"There is not only a mobile phone in the color box, but also a user manual, earphones, a charger, etc. Since the incoming materials on the production line are in a disordered state and the placement of accessories is not the same every time, the assembly steps are different each time." A Huawei Cloud personnel explained. "What they are exploring is a system that can understand the environment, plan actions, and execute decisions."

Why is "flexible" manufacturing so crucial? Gao Yang, the co - founder of Qianxun Intelligence, gave an explanation: "Currently, the annual shipment volume of industrial robots is only 540,000. Why is it so small? Because they are not easy to use. After each robot enters the factory, it needs to be programmed for 2 - 3 months." In other words, the "intelligence" of robots is artificially set.

A similar problem also occurs in the automotive industry. Although the stamping and painting workshops are highly automated, once the vehicle model is changed, it takes at least six months to change the production line. "If embodied intelligence can automatically adjust production parameters according to the vehicle model and work flexibly like a human being, it will greatly shorten the cycle." A Huawei Cloud personnel said.

For this reason, Kuka Robotics, a subsidiary of Midea, has begun to reserve computing power interfaces in the cabinets of robotic arms, preparing in advance for "embodied intelligence".

Embodied intelligence is not only being implemented in industry but also entering daily life scenarios.

"When you place an order for medicine on a certain platform, it is very likely that our humanoid robots are preparing the goods." Wang He, the founder and CTO of Galaxy Universal Robots, showed a video of a robot operating in a 24 - hour pharmacy: The robot shuttles between the open - shelf area and the dense shelves, picks up goods independently, puts them into the cabinet, and then the courier takes them away.

"Seven pharmacies in Beijing are already in normal operation, and 100 will be deployed in Beijing, Shanghai, and Shenzhen by the end of this year." Wang He said. "A 24 - hour pharmacy operates in three shifts, and the annual labor cost is more than 700,000 yuan. Our robots can reduce the cost to a lower level."

In a gift shop of a seven - star hotel in the Middle East, robots act as receptionists, attracting customers to shop.

The goal of embodied intelligence is not necessarily to replace existing robotic arms. Through more than a year of industrial research, Wang Zhongyuan, the dean of the Zhiyuan Research Institute, found that repetitive and boring processes such as logistics sorting and laser coding, which require more than ten hours of work per day, cause high human fatigue, and even pose safety hazards, are the most suitable first - wave entry points for embodied intelligence.

Embodied intelligence may also be the key for Chinese manufacturing to go global. "In fact, most Chinese companies lose money when setting up factories in the United States and Europe because of high labor costs and expensive raw materials," said Professor Sun Fuchun from Tsinghua University. "The only way is to take robots there and operate them remotely through the cloud - edge - end. This is an important issue that embodied intelligence will face in the next step."

However, real implementation is far from just a "show":

"The cost of dexterous hands is very high. Those with sensors may cost more than 100,000 yuan, but their lifespan is only a few thousand times." A practitioner said bluntly.

Making humanoid robots "walk steadily" is also a challenge. Zhao Tongyang, the founder of Zhongqing Robotics, showed a scenario where a humanoid robot is required to walk from point A to point B in one building, take the elevator, change floors, and reach another building. "Theoretically, it is possible, but in reality, no one can really do it."

Another key point is the lifespan. The lifespan of a car is between 10 and 15 years, while currently the average lifespan of robots is about 2 years. "We expect to achieve a mechanical lifespan of 10 - 15 years within 5 years." Zhao Tongyang said.

Safety standards have also become the threshold for entering factories. For example, the battery must meet the industrial - grade fire and explosion - proof standards. Ternary lithium batteries and lead - acid batteries won't work.

Meanwhile, another more fundamental reflection is underway: In the model training of embodied intelligence, what path can we take to obtain stronger generalization ability? How is our method different from that of overseas? This is related to the roadmap for the future evolution of underlying technologies.

02 After GPT, robots still lack a real brain

Before the popularity of large - models, robots could only perform one task - delivering meals, screwing, or transporting materials. They were like well - trained operators but only had one "instinct". However, now the industry is trying to break this limitation.

"Before 2022, embodied intelligence faced single tasks, single scenarios, and single entities." Zhang Shanghang, the director of the Embodied Multimodal Large - Model Center of the Beijing Zhiyuan Research Institute, said. The turning point came in the year when ChatGPT emerged, and robots began to have a "smarter brain".

The upsurge of embodied intelligence is essentially the integration of large - models and robot technology. Multimodal large - models bring stronger generalization ability, promoting the evolution of robots from "specialists" to "generalists". However, being a "generalist" is not easy. The industry believes that the challenges of embodied intelligence far exceed those of intelligent driving.

Zhang Shanghang gave an example. Currently, embodied intelligence mainly follows three technical routes: the end - to - end VLA model (Vision - Language - Action), the "brain + cerebellum" architecture, and the world model.

Among them, the VLA model is the most intuitive. It receives human language and visual input and outputs action instructions, forming a fast - closed loop. Wang He, the founder of Galaxy Universal Robots, believes that: "VLA is very promising."

However, in the view of Professor Sun Fuchun from Tsinghua University, VLA is not enough.

"Fei - Fei Li especially emphasizes the role of vision and proposes spatial intelligence, which is the ability to perceive, reason, and act in three - dimensional space." However, VLA lacks the elements to discriminate physical properties and use physical laws to do things, and also lacks sufficient control trajectories. Sun Fuchun said, "This is exactly the reason why we are building a world model."

The so - called world model is a full - element model, and spatial intelligence is just a projection of the world model into the visual space. The Sun Fuchun team plans to train a large - model containing 2 million trajectories and 52TB of data, aiming to achieve highly generalized embodied intelligence in various factories. Their benchmark is the world model constructed by NVIDIA, which contains 1.2 million trajectories and 32TB of data.

The third path is the "brain + cerebellum" mode, which is an image - based term proposed in China. The "brain" is responsible for task planning, and the "cerebellum" is responsible for specific execution. The advantage lies in modularity and interpretability, making it easier to implement. However, there are also thresholds. "Not all multimodal large - models can serve as the 'brain'." Zhang Shanghang said. "For example, GPT - 4o is not ideal as the brain of a robot because it lacks long - range planning and spatial understanding abilities."

Regarding the "brain + cerebellum" technical route, Dr. Tang Jian from the Beijing Humanoid Robot Innovation Center believes that there are mainly two "sticking points": One is how the "brain" can accurately plan various tasks and precisely decompose and plan complex tasks into more than a dozen or even dozens of steps, which is quite difficult. The other is the skill library of the embodied "cerebellum". Both need to have strong generalization ability because there are countless tasks.

Gao Yang also gave their classification of the generalization ability of embodied intelligence. He believes that L3 is a very important node because it means full autonomy in a specific environment, and it is also a relatively difficult node.

The industry is gradually making progress. For example, at the Beijing Zhiyuan Conference, the Zhiyuan Research Institute released the embodied brain RoboBrain 2.0 and the cross - entity collaboration framework RoboOS 2.0. Through it, global developers can connect the brain model with different robot "cerebellum" skills developed on the same entity with just one click, without the need for an adaptation process. RoboOS 2.0 and RoboBrain 2.0 are fully open - source.

Dr. Tang Jian from the Beijing Humanoid Robot Innovation Center also revealed that they plan to launch a unified development platform called 'Huisi Kaiwu', helping developers to develop all robot tasks in one way. In terms of the skill library of the embodied "cerebellum", they can currently support more than 30 skills, and the goal is to support more than 100.

Some industry insiders believe that the final competition for the "brain" and "cerebellum" will converge to companies with large - model R & D capabilities. "Because it is very costly, and it is based on multimodal models."

"In the next 5 - 10 years, the model of 'brain - cerebellum' integration may mature, but not today. The reason is simple: data is limited." Wang Zhongyuan said. And the cerebellum model that can truly achieve cross - entity operation also needs the hardware to be eliminated and converged in rounds of industrial iterations.

03 Without good data, robots can't learn to act

Although the brain architecture and technical routes are evolving rapidly, all routes ultimately revolve around a consensus: data, which is the toughest nut to crack in embodied intelligence.

"The biggest pain point we face is data." Gao Yang, the co - founder of Qianxun Intelligence, said bluntly, referring to both quality and quantity. They proposed the Scaling Law of embodied intelligence, which has attracted industry attention.

"Large language models have the Scaling Law. We also studied embodied intelligence, collected about 40,000 real - world trajectories, and conducted about 15,000 real - world robot tests." Gao Yang said. "In short, the conclusion is that embodied intelligence also satisfies the Scaling Law. For every 10 - fold increase in data collection, the error rate of robots will decrease by about 10 times. If you want to increase the success rate from 99% to 99.9%, it means you need to collect 10 times more data, and the cost also increases exponentially."

If following the above Scaling Law, Wang He from Galaxy Universal believes that when deploying VLA in a car factory, the success rate must be above 99.99%. Because the car factory will be fined 10,000 yuan for every minute of downtime. If relying on real data, it may be necessary