The booming embodied intelligence hides half of the autonomous driving industry
Autonomous driving has proven over the past decade that in specific scenarios, "being usable" is more important than "seeming human-like." Embodied intelligence is now repeating this truth.
In 2017, the autonomous driving industry once believed that the ultimate goal was just around the corner.
That year, L4 and L5 levels of autonomous driving were repeatedly discussed, and driverless taxis were seen as an imminent future. However, after a decade, it wasn't the companies that first promised "driverless" technology that changed the industry landscape, but those who were the first to integrate assisted driving into mass-produced vehicles and establish a closed-loop data system.
Today, embodied intelligence stands at a similar crossroads.
At the just-concluded Beijing Academy of Artificial Intelligence Conference, humanoid robots serving coffee, playing table tennis, and performing dynamic sorting attracted numerous onlookers. The "ChatGPT moment for robots" also became a frequently discussed topic.
On the other hand, more and more entrepreneurs from the autonomous driving industry are now talking about how to find the first stable operating scenario, how to establish a closed-loop data system in the real world, and how to make robots "start running before getting smarter."
Liu Dong, the founder of Xingyuanzhi, refers to autonomous driving as the "simplest form of embodiment." In his view, from obstacle avoidance and navigation in two-dimensional space to physical interaction in three-dimensional space, robots face more complex problems than in autonomous driving.
Is embodied intelligence replicating the narrative rhythm of intelligent driving? Why are those from the intelligent driving industry becoming the key variable in the industry's transformation? Will their "gradual implementation" approach lead embodied intelligence down a different path?
When an Unarrived Moment Becomes the Industry's Totem
At the Beijing Academy of Artificial Intelligence Conference, Wang He, the founder and CTO of Galaxy Universal, believes that the ChatGPT moment for robots refers to when the model has zero-shot learning ability, can complete 70% to 80% of human skills in specific scenarios without specialized training, and is highly accessible, allowing even junior high school graduates to operate it.
Liu Dong, the founder of Xingyuanzhi, believes that "the same is true for embodied intelligence now. Everyone is aiming at practical application scenarios, but there are still not many L2-level implementations. It's similar to the state of intelligent driving in 2015 and 2016, just at the beginning."
Around 2017, the autonomous driving industry was also filled with similar optimistic expectations: L4-level driverless technology was generally considered to be "ready for mass production in three to five years." However, in real mass-produced vehicles at that time, even functions like lane-keeping on highways and adaptive cruise control were still being refined.
Whether it was autonomous driving back then or embodied intelligence today, "the end goal is discussed before the path." The industry first forms a collective imagination of the future and then looks back to find the engineering route to get there.
At the Beijing Academy of Artificial Intelligence Conference, this misalignment was presented in another form. Humanoid robots serving coffee, playing table tennis with humans, and performing dynamic sorting on the assembly line attracted large crowds in front of the exhibition booths.
Meanwhile, Xingyuanzhi's newly released ω-EVA model achieved a success rate of 98.6% on LIBERO, and the task success rate on RoboTwin increased from 88.9% to 90.3%.
The numbers are impressive, but Liu Dong still provided a relatively sober assessment of the implementation levels during an interview: Purely mobile inspection and guidance services are already relatively mature; grasping and placing operations have solved about 90% of the scenarios, but there are still some items that are difficult to handle; as for complex operations such as hotel cleaning and home services, "it's still quite difficult to implement in the short term."
This doesn't mean that demos have no value. On the contrary, in a new technology field, demos are a necessary proof of the feasibility of the technical route.
However, it's important to note that demos prove "this can be done under specific conditions," while delivery requires "this can be done repeatedly under changing conditions."
It took autonomous driving 10 years to bridge this gap.
However, the enthusiasm of capital and the industrial side has already arrived ahead of time. Wang Zhongyuan, the director of the Beijing Academy of Artificial Intelligence, mentioned that at least 15 CEOs of embodied intelligence companies with valuations exceeding 10 billion yuan gathered at this conference. "Embodied Intelligence and Humanoid Robots" was one of the most popular forums.
This is hard not to associate with the "All in AI" trend in the autonomous driving circle in 2017. At that time, as long as a project had the words "autonomous driving," its valuation and exposure would automatically increase.
However, the actual business progress may not keep up with the narrative rhythm. Xingyuanzhi is one of the few companies that can present specific implementation cases: the embodied brain on forklifts, robot dogs picking up trash in open areas, and automated sorting in logistics scenarios.
Liu Dong mentioned that these partnerships were negotiated with customers on a case-by-case basis. Data needs to be shared, and scenarios need to be customized. This is not the kind of "release and it's universal" narrative. Instead, it's about finding a specific scenario first, getting the system running in it, and then discussing generalization.
So, if we have to draw a line between autonomous driving and embodied intelligence, it may not be a overlapping narrative, but rather that the two industries face the same temptation at similar stages.
The "Second Entrepreneurship" of a Generation of Intelligent Driving Professionals
There are many founders with a background in autonomous driving, like Liu Dong, in the field of embodied intelligence.
Autonomous driving solved the problem of "keeping a vehicle from hitting objects on a flat surface," while embodied intelligence has to deal with "enabling devices to interact with objects in three-dimensional space." Liu Dong compares intelligent driving to the simplest form of embodiment. "When intelligent driving was being developed, it was about avoiding all objects within a two-dimensional plane without interacting with them.
In the field of embodied intelligence now, in addition to precise navigation and movement, devices also need to interact with objects in three-dimensional space."
The difference between "avoiding" and "picking up" may seem like just an increase in action complexity, but in terms of engineering implementation, it represents a completely different set of system constraints.
In autonomous driving, cameras and lidars are mainly used for environmental perception and obstacle recognition, and the decision-making process is relatively clear: see, judge, and detour. In embodied intelligence, a device not only has to "see" a cup but also decide "how to pick it up," "whether it will spill when picked up," and "whether the placement position is accurate."
Force control, tactile sensing, and multi-modal synchronization, which are almost non-existent in autonomous driving, have become daily tasks in embodied intelligence.
So, when these people from the intelligent driving industry enter the field of embodied intelligence, they bring not only the transfer of technical stacks but also a set of industrial memories from their previous experiences.
In 2017, the autonomous driving industry was collectively tempted by the idea of "full-stack self-research." Companies tried to do everything themselves, including algorithms, hardware, data, and vehicles. The logic at that time was that only a closed-loop system could provide the best user experience. However, the subsequent industrial reality proved that before the sales volume reached a certain scale, full-stack self-research was an extremely expensive gamble.
When asked if full-stack development by leading companies would affect them, Liu Dong's answer showed the influence of this experience: "Before the actual sales volume increases, it's impossible for a company to support full-stack R & D investment, unless it's as large as Tesla and has no financial concerns."
He further predicted that among the nearly 200 embodied intelligence companies in the market, only "two or three at most" have the ability to achieve a full-stack closed-loop system. More companies will face a choice: develop the core technology from scratch or purchase it from a third party?
The autonomous driving industry ultimately proved that the threshold for full-stack self-research is extremely high, and only a few automakers can afford it. As a result, the industry gradually divided: some new players with strong financial and technical capabilities chose in-depth self-research, while more automakers, including some traditional giants and new brands lacking self-research capabilities, started to cooperate with suppliers such as Huawei, Momenta, DJI, and Baidu, or adopted a compromise approach of "self-developing some modules + outsourcing core algorithms."
Liu Dong believes that the field of embodied intelligence will also show a similar pattern: "Some companies are good at making the physical body, while others are good at developing models, similar to what we saw in the development of the autonomous driving industry in the automotive sector."
Based on this judgment, Xingyuanzhi chose not to make the physical body itself. The outside world once compared Xingyuanzhi to "the Huawei in the embodied intelligence track," providing brain models and edge computing platforms that cover more than 70% of the leading physical body customers in the market.
It's hard to say whether this choice is due to the "lessons learned from failure" in 2017 or simply because engineers are used to the efficiency logic of industrial chain division. However, one thing is clear: when a team has experienced the stage of "wanting to do everything by themselves," they will think earlier about "what should be done by others" when entering a new battlefield.
In addition to the differentiation in business models, people from the intelligent driving industry also bring a practical understanding of "implementation."
In the field of autonomous driving, they have experienced the debate between "cloud computing power" and "vehicle-side computing power" and understand the significance of controlling latency for safety systems.
This experience has been reactivated in embodied intelligence. When explaining why edge deployment is necessary, Liu Dong doesn't mention technical preference but physical constraints. With more than a dozen cameras and three lidars, the data volume per second is several gigabytes. If the data is transmitted to the cloud via Wi-Fi or 5G, "the robot will have already crashed by the time the cloud finishes reasoning."
So, they accept the inevitability of edge closed-loop systems earlier, rather than treating it as an option for discussion.
Embodied Intelligence Can't Rely on "Brute Force" to Achieve Miracles
Looking back at the development of autonomous driving, L5 and Robotaxi were the most attention-grabbing, but it was ADAS and L2+ assisted driving that first entered the transportation system.
They may not be as exciting as L5, but they have been accumulating data and improving the system during continuous operation, providing a foundation for the further evolution of autonomous driving.
Embodied intelligence is going through a similar process. The dream of a home nanny robot is still far away, and the general-purpose robot brain is not yet mature. However, scenarios such as forklifts, robot dogs, and logistics sorting have started to be implemented. They may not be the most human-like, but they have the best chance of establishing a closed-loop data system first.
If ADAS is the bridge for autonomous driving to reach L4, then today's forklifts and robot dogs are the bridge for embodied intelligence to reach AGI.
Liu Dong divides the implementation difficulty into three levels. The first level is "pure movement": inspection, guidance, and product promotion. Robots only need to move in space, identify targets, and take photos for records without complex physical interaction with objects. This level is already relatively mature. For example, Xingyuanzhi's robot dogs picking up trash and doing cleaning in open areas are applications at this level.
The second level is "grasping and placing operations": sorting in warehouses, loading and unloading in pharmacies, and simple material handling in factories. Liu Dong admits that about 90% of the scenarios at this level have been solved, but "there are still some items that are difficult to handle, and the success rate is not high." This 10% gap may seem small, but in the real business environment, it could be the key to whether a customer is willing to sign a contract. In the context of autonomous driving in 2017, this is similar to the state when "High-Speed NOA" was first launched: it can operate, but users are still hesitant to completely let go.
The third level is "complex operations": hotel cleaning, home services, and precise assembly. These scenarios involve multi-step task chains, unstructured environments, and operations on flexible objects. Liu Dong believes that "it's still quite difficult to implement in the short term."
This "leveling" approach not only reflects the engineering pragmatism transplanted from intelligent driving but also is limited by the data constraints of embodied intelligence. Sun Zhenguo, the co-founder of Xingyuanzhi, mentioned in an interview that large language models can freely obtain almost infinite text data from the Internet, but embodied intelligence doesn't have "Internet-level physical data."
Data collection factories led by local governments have invested a large number of robotic devices to collect motion data, but the amount of data collected is still far from enough for large-scale training. Large language models can have parameters in the hundreds of billions or even trillions, while embodied intelligence models are still hovering in the scale of a few billion or dozens of billions.
This bottleneck means that embodied intelligence can't break through overnight like large language models by relying on "brute force." It has to run in specific scenarios like autonomous driving, feeding the model with real physical interactions.
When describing the implementation case of forklifts, Liu Dong provided a very convincing detail: There were also automated solutions in logistics warehouses before, but they were "rule-based." They required trucks to park in accurate positions, goods to have accurate pallets, and pallets to have accurate shapes.
The value of the embodied brain lies in that it can "flexibly handle different tasks." Whether the trucks are of different sizes, the goods have different shapes, or the pallets are available or not, the system can still independently plan the unloading logic, deciding what to unload first and what to unload later to avoid collisions and incomplete unloading.
This "flexibility" is not achieved through a larger model all at once but is gradually refined through "closed-loop data in specific scenarios."
Xingyuanzhi's forklift project took about "two months" to develop the first version of the system, which is quite fast in the field of embodied intelligence. However, Liu Dong emphasized that they reused the previous algorithm base and "completely deployed it on the edge side."
This also corresponds to Liu Dong's prediction of the future pattern: Embodied brain companies will eventually "diverge into specialized companies in different vertical fields." Some will be good at home scenarios, some at logistics scenarios, and some at industrial operations. This is similar to the pattern in the autonomous driving industry: High-Speed NOA, Urban NOA, Memory Parking, Valet Parking... Specialized companies have emerged in each niche market.
So, going back to the original question, is embodied intelligence repeating the story of autonomous driving? The answer is that the narrative rhythm is indeed similar. The ultimate goal is over-consumed in advance, there is a gap between demos and actual delivery, and the industry will initially pursue the most "