Embodied Intelligence: Five Years of Patience Still Required

Embodied intelligence requires five years of patience, and the data bottleneck awaits breakthroughs through simulation. Tesla has an edge.

I flew to Silicon Valley again last month and had some exchanges with scientists and entrepreneurs in the field of embodied intelligence. To sum it up, one core feeling is: For this grand story of embodied intelligence, we need to have "five - year patience". This judgment stems from an analysis of its current stage, core bottlenecks, and future evolution path.

The Hot "Production Line Story" and the Cold Reality

Undoubtedly, the hottest area in the embodied intelligence track is humanoid robots.

The story and prospect that many domestic embodied intelligence companies are telling is to introduce humanoid robots into production lines. But after in - depth conversations with several founders in the field of embodied intelligence at home and abroad, the general concern is: Forcing an immature general - purpose robot into an industrial production line centered on precision and efficiency is actually a very challenging task at present.

To pursue generality, robots must break away from the "special - purpose" programming and control technical routes of robotic arms in the past. Robots need to have a "brain" and a "cerebellum", with autonomous reasoning and control abilities. The biggest advantage of making robots "human - shaped" is also for the "generality" of existing human tools, facilities, social environment, and production scenarios. However, the "brain" of today's robots is not yet fully developed. Although current technology can achieve "motion like a human" (Motion like a human), it is far from achieving "decision - making like a human" (Decision like a human). Robots can imitate smooth and human - like movements in a controlled environment, but their decision - making ability is still very weak when facing dynamic changes and unexpected situations in the real world.

At this time, general - purpose robots are essentially exchanging "precision" and "efficiency" for "generality". Since robotic arms that prioritize precision and efficiency on the production line have long been mass - applied, it is somewhat misaligned to introduce far - from - mature humanoid robots into production lines and use "generality" in scenarios that require high precision and high efficiency.

It can basically be asserted that it is very challenging for today's general - purpose robots to enter all scenarios where "precision", "efficiency", and "cost - effectiveness" are the bottom lines. In many cases, the scenarios that startup companies claim to have achieved are mostly "demonstrative", "experimental", or even "financing - supported", and are not really rational, market - oriented, and cost - effective transactions.

To be more straightforward, at present, the core value provided by general - purpose robots in embodied intelligence, especially humanoid robots, is more like an "emotional value" - using continuous progress in capabilities to lead social consensus and expectations, and then obtaining more resources to promote the accelerated progress of technology.

We can't say this is meaningless. Just like the Apollo moon - landing program in the 1960s, which was essentially an "unreasonable" plan both technically and commercially at that time and did not quickly bring commercial value. Its essential goal was the "emotional value" under specific historical conditions. However, the resource aggregation it promoted and the talent and technology ecosystem it built were of great significance to the development of aerospace technology and brought huge commercial value in the aerospace field decades later.

The field of embodied intelligence, especially humanoid robots, is at least currently more like a growing child. Every bit of progress can ignite our imagination and confidence in the future. But the problem is that "parents" need to have a correct understanding. Even if a child shows amazing potential and unexpected progress, growing up and experiencing the world are still the focuses at this stage. It may be a problem to prematurely assess whether the child can shoulder the burden of supporting the family. If "parents" mistake the confidence in the demo for the determination for commercial deployment and over - draw on its future, it is very likely that the praise for this child will turn into criticism. For example, when many "production line stories" cannot be fulfilled next year, the industry may experience a certain degree of setback.

So what might be the correct expectation? Regarding the issue of general - purpose robots, we can draw an analogy with the development of large language models (LLM). A reasonable expectation I've gathered is that within one to two years, embodied intelligence is expected to have its "GPT - 3.0 moment" - which means that in a laboratory environment, insiders will see obvious technological breakthroughs in the general model (brain + cerebellum) of robots and reach a consensus on the mainstream technical route, just like the shock that GPT - 3 brought to the industry when it was born.

However, the journey from 3.0 to 3.5 (ChatGPT), which the public can use for some needs, and even to 4.0, which begins to build a new industrial ecosystem, is still a long one. We may need to have "five - year patience".

A Key to Reaching the "GPT - 3.0 Moment": Can the Data Problem Be Transformed into a Computing Power Problem?

From the current stage dominated by "emotional value" to the next stage of technological breakthrough that excites industry insiders, that is, the so - called "GPT - 3.0 moment", what is the core problem to be solved?

Some core practitioners I've talked to believe that the key lies in breaking through the data bottleneck. Although the model route has not fully converged, switching the model architecture may only involve a few hundred lines of code. Once someone has the right idea, others can quickly follow, making it difficult to form a long - term barrier. So the real gap lies in how to obtain large - scale, high - quality, and diverse data.

One way to obtain data in the field of embodied intelligence is to collect it from the real world. Human operators can perform teleoperation like playing a VR game or record movements through teaching.

The production of this "real - world data" has three limitations: First, the scale cannot be increased; second, the cost cannot be reduced; third, and more importantly, the diversity is insufficient: You can only collect data from scenarios that you can physically set up. Is it possible to let a robot practice picking up an apple at a specific angle on the corner of a table ten thousand times in the real world? Almost impossible. Not to mention those dangerous and extreme "corner cases".

This data dilemma forms a sharp contrast with another large - scale field of embodied intelligence - autonomous driving. Autonomous driving is currently the only field without a "pre - training data bottleneck". Every car on the road, whether or not its autonomous driving function is enabled, continuously collects real - world driving data through its cameras and sensors. This enables car companies to obtain massive, diverse, and real pre - training data at a very low marginal cost. In contrast, the field of general - purpose robots does not have this advantage at all, and its difficult situation in data acquisition is particularly prominent.

It is these limitations that make data the narrowest bottleneck in the entire embodied intelligence track.

Recently, many global teams are facing this problem and promoting a paradigm shift: Through a high - precision physics engine, effectively transform the "data problem" into a "computing power problem".

In a sufficiently realistic simulator, you can use code, rather than human labor, to create infinite data. Want to change the table material? One line of code. Want to change the direction of light? One line of code. Want to let an object fall ten thousand times from different angles with micro - second - level differences? A simple loop is enough. The diverse data that used to take a team several months to collect may now only require a bunch of graphics cards running overnight.

This completes the evolution from "manual production" to an automated "data factory". Data is no longer a scarce resource that needs to be laboriously "collected" but an industrial product that can be "generated" by computing power according to demand. This is the core meaning of transforming the "data problem" into a "computing power problem".

The mainstream expectation I've heard is that the industry has a chance to see a model with generalization ability in the next 1 - 2 years and reach a consensus on the mainstream technical route, just like the shock that GPT - 3 brought to the industry when it was born. This is a crucial step from 0 to 1.

A Long Road from "GPT - 3.0" to "4.0"

The journey from the exciting 3.0 moment for industry insiders to the 4.0 stage where the public can use it safely and reliably is the longest part of the five - year patience. Behind this are the unique and cruel physical constraints of embodied intelligence:

First, the boundaries of simulation determine that it cannot alone complete the leap from 3.0 to 4.0. Simulation data is not a panacea. A common consensus in the industry is that simulation can efficiently solve 90% of the model problems, but the last gap "from 90% to 99.999%" still needs to be filled with real - world data.

No matter how realistic the simulation world is, it is only an "approximation" of the real world. It can perfectly simulate Newton's laws and teach robots the "Physics 101" of this world, such as objects falling and bouncing after collisions. Relying on massive simulation data, robots can build a "general understanding" of the laws of the world's operation. However, the real world is full of "nightmares" for simulation - those "long - tail details" that are difficult to accurately describe with mathematical formulas. For example, how does a soft cloth wrinkle on the corner of a table? How complex are the friction and deformation on the surface of a flattened aluminum can? How does the reflection and flow of a puddle of water affect visual judgment?

Simulation can solve the "breadth" problem of 90% of the capabilities, like a perfect driving school teaching robots general abilities. But what determines 100% reliability is the last 10% of the "long - tail details" - the real physical world that simulation cannot perfectly reproduce. This "simulation - to - reality gap" (Sim2Real Gap) must be bridged with real data.

Therefore, an increasingly clear ideal path in the industry is to use large - scale simulation data to build the robot's basic understanding and general abilities of the physical world (solving 90% of the problems); then, use high - value, real - world data focused on specific scenarios for final "fine - tuning" to bridge the "simulation - to - reality" gap, solve those most difficult corner cases, and overcome the last 10%.

This also brings the second constraint: The "trial - and - error cost" in embodied intelligence and the "loop speed in the physical world" are not on the same level as those in large models. A large model can perform thousands of "virtual trial - and - errors" and iterate quickly within one second. However, a "hallucination" of a robot - whether it is incorrect force control or path planning - may lead to task failure, property loss, or even safety accidents. Each trial - and - error in the physical world is not only costly but also time - consuming. It takes a few seconds to execute an action and several minutes for a task sequence. This "physical loop speed" measured in seconds or even minutes is several orders of magnitude slower than the iteration speed in the digital world measured in milliseconds.

In addition, a key step in the evolution of LLM from 3.0 to 4.0 is the introduction of large - scale human feedback (RLHF). This relies on the ability to instantly distribute software to millions of users. However, for robots to obtain large - scale and diverse real - world feedback data, they must first have a large number of robot hardware deployed in the real environment. But for the market to accept large - scale deployment, robots must first achieve extremely high reliability and cost - effectiveness. This contradiction of "which comes first, hardware deployment or mature intelligence" is a huge commercial and engineering obstacle that does not exist in the software world.

The real world has even fatter and longer tails. The long - tail problem in language is already complex enough, but the complexity of the long - tail problem in the physical world increases exponentially. For example, for the same "door - opening" task, the weight of the door, the shape of the handle, the size of the damping, and even slight changes in environmental light may cause the model to fail. The physical world is full of continuous, high - dimensional, and noisy variables, which means that the distribution of its "corner cases" is much denser and more fatal than in the text world.

Recall autonomous driving, this "wheeled embodied intelligence" in a relatively restricted scenario. Even with the support of massive real - world data, after solving 99% of the problems, it has still been struggling with the last 1% of long - tail scenarios for nearly a decade. The task space of general - purpose robots requires physical interaction with countless objects of various shapes, and its complexity far exceeds that of autonomous driving on a two - dimensional plane.

Therefore, the "five - year agreement" is not an arbitrary number. It is a rational expectation based on the above - mentioned physical constraints, hardware bottlenecks, and commercial realities. We need at least one to two years to welcome the exciting "GPT - 3.0 moment"; then, we need at least three to four years for the gradual deployment of hardware, the long - term accumulation of real - world data, and the arduous task of overcoming the endless long - tail problems in the physical world to truly move towards the reliable and usable "GPT - 4.0 era".

So, it is a rational expectation to have five - year or even longer patience for this matter.

Who Can Finish This Marathon?

The field of embodied intelligence is a long and challenging road. What kind of participants are likely to finish the whole journey? Who is more likely to win?

Based on the previous analysis, we can roughly outline several essential elements for the final players:

1. A World - Class AI Team:

Capable of effectively transforming the "data problem" into a "computing power problem" through a high - precision physics engine to accelerate reaching the 90% stage.

2. Massive Real - World Data:

Used to bridge the Sim2Real gap, solve the long - tail problems of various corner cases, and achieve the last 10% from 90% to 99%.

3. Top - Tier Industrial Manufacturing Capability:

Used to solve the paradox of hardware deployment, capable of mass - producing and deploying the "real body" of robots into the physical world at a controllable cost and with reliable quality.

4. Abundant Capital and Firm Belief:

Used to endure the slow loop speed in the physical world and support high - cost investments and uncertainties for several years or even a decade.

When we use this list to examine all current players, it may sound a bit clichéd, but currently, the most prominent one who can meet all these requirements is Elon Musk. He not only has a top - tier AI team, abundant capital, and an almost unparalleled personal belief. More importantly, he has demonstrated world - class dominance in "data closed - loop" and "industrial manufacturing" and has a structural advantage, making him the most prominent leading player.

Of course, even though this is the current reality, understanding it is not for simple acceptance. Instead, I look forward to more new forces becoming variables and subverting this logic. The future is yet to be written, and new history is never predestined by reasoning.

This article is from the WeChat official account "Zhang Peng's Technology and Business Observation", author: Zhang Peng, published by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Embodied intelligence still requires "five years of patience."

The Hot "Production Line Story" and the Cold Reality

A Key to Reaching the "GPT - 3.0 Moment": Can the Data Problem Be Transformed into a Computing Power Problem?

A Long Road from "GPT - 3.0" to "4.0"

Who Can Finish This Marathon?