HomeArticle

The lack of consensus on embodied intelligence is actually the best consensus.

具身研习社2025-11-26 07:27
It's more lively when there is no consensus.

In the early days of technology, there were always those who tried to find the one and only correct path, hoping to pierce through the fog with a single bet. However, the complexity of embodied intelligence is reminding the industry that embodied intelligence doesn't grow from a single path; instead, it is "sculpted" through countless trials and errors, conflicts, and reconciliations. Imperfect models, incomplete data, and non - unified architectures may sound like flaws, but they are precisely where the most genuine vitality of embodied intelligence lies.

As expected, embodied intelligence continued to forge ahead with high momentum by the end of 2025.

Even more predictably, there is still no consensus on embodied intelligence.

At the 2025 Zhipu Embodied OpenDay Round - Table Forum, the top domestic practitioners in embodied intelligence had a "heart - to - heart talk with different views". Whether it was the choice of model architecture or the use of data, no unified development direction was found during the round - table discussion. For a while, many people felt regretful about the lack of consensus on embodied intelligence.

However, the Embodied Learning Society believes that the "lack of consensus" also means that embodied intelligence is still worth looking forward to, and the technology will "surprise" us unexpectedly. After all, a clear direction might actually be a bit boring. When we no longer rely on "certainty", we can actually discern some trends. Perhaps the "lack of consensus" itself is a kind of consensus.

Image source: Zhipu Institute

From an industrial perspective, the lack of consensus has three positive implications:

Firstly, the lack of consensus essentially breaks the monopolistic discourse power of a single technical route, preventing the industry from falling into the innovation trap of "path dependence". In the field of embodied intelligence, from the divergence in technical routes between "hierarchical architecture vs end - to - end" to the implementation choices between "general humanoid robots vs scenario - specific embodied intelligence", the lack of consensus allows teams with different technical concepts and academic backgrounds to have equal space for trial and error.

Secondly, consensus in mature industries often comes with high entry barriers. The "lack of consensus" in embodied intelligence provides opportunities for small and medium - sized enterprises, start - up teams, and even cross - border players to overtake on the curve. Without having to follow existing technical standards or business rules, new entrants can enter the track with differentiated advantages.

Thirdly, as an interdisciplinary track, the technical foundation of embodied intelligence is still evolving rapidly. Forming a consensus too early may solidify the technical path and limit the industry's breakthrough to a higher dimension. The core value of the lack of consensus lies in reserving an "elastic space" for technological iteration.

At the Zhipu Embodied OpenDay Round - Table Forum, there was a lot of discussion about the "lack of consensus", which also reflected more possibilities. Based on the responses of the guests present, the Embodied Learning Society identified five signals of embodied intelligence, and the future development direction may be hidden in these signals.

The models aren't good enough; some want to start anew

Signal 1: The world model can't shoulder the heavy responsibility for now

In the discussion of embodied intelligence models, the "hot - shot" world model is an unavoidable topic.

Its core value lies in "prediction". Enabling robots to predict the next change based on the current spatio - temporal state and then plan actions, just like humans, is widely recognized by the round - table guests. Wang He, an assistant professor at Peking University and the founder of Galaxy General, took robot motion control as an example. He pointed out that whether it is the bipedal walking and dancing of humanoid robots or the fine manipulation of dexterous hands, the underlying control logic requires the ability to predict physical interactions, and the world model can provide such support. However, for the world model to truly serve robots, its training data must contain more data about the robots themselves.

However, the shortcomings of the world model are also prominent, and it is difficult for it to be the "all - purpose solution" for embodied intelligence alone. Wang He emphasized that many current world models rely on human behavior videos for training. However, the physical structures of robots (such as wheeled chassis and multi - degree - of - freedom robotic arms) are very different from those of humans, so these data are of limited help for the actual operations of robots. Cheng Hao, the founder and CEO of Acceleration Evolution, also mentioned that in real - world scenarios such as cooking and complex assembly, the prediction accuracy of the world model is still insufficient. Only by using hierarchical models to solve simple tasks first and then gradually iterating and upgrading can progress be made.

Signal 2: It's necessary to create new models

Since the existing models can't meet the requirements, "creating models exclusive to embodied intelligence" has become the consensus of many enterprises.

Zhao Xing, an assistant professor at the Institute for Interdisciplinary Information Sciences of Tsinghua University and the CTO of Xinghai Map, said that embodied intelligence requires a "Large Action Model" parallel to large - language models. This type of model should focus on "actions" rather than language. He explained that the evolution of human intelligence follows the sequence of "actions first, then vision, and finally language". Robots should also follow a similar logic when adapting to the physical world. For example, when driving, humans rely on visual perception to observe the road conditions and use actions to control the steering wheel, without the need for language "translation". Embodied models should also prioritize connecting the "vision - action" closed - loop.

Wang Qian, the founder and CEO of Independent Variable, had a more specific view. He believes that embodied intelligence requires a set of "basic models for the physical world" that can both control robot actions and predict physical laws as a world model. Multimodal models in the virtual world are trained with text and images, but the fine processes of friction, collision, and force feedback in the physical world cannot be accurately described in language. When a robot grabs an egg, it needs to sense the fragility of the eggshell and adjust the grip force. This understanding of physical properties must rely on models specifically trained for the physical world.

Signal 3: Innovate from the underlying architecture

In the past few years, the Transformer architecture has supported the explosion of large - language models such as ChatGPT with its cross - modal processing ability. However, its applicability in the field of embodied intelligence is being questioned. Zhang Jiaxing, the chief AI scientist of China Merchants Group, is a representative of this view. He said bluntly that "embodied intelligence cannot follow the old path from LLM to VLM".

In his opinion, the Transformer architecture is centered around language, mapping modalities such as vision and action to language, which is contrary to the operation logic of the physical world. When humans perform actions, visual perception directly guides muscle movement without the need for language "translation". He revealed that top Silicon Valley teams are exploring new architectures such as "Vision First" or "Vision Action First" to allow direct interaction between vision and action and reduce the loss caused by language mediation.

Wang He also added that as a cross - modal Attention mechanism, Transformer is very versatile. For example, it can handle text, video, and audio modalities. However, "the problem with embodied intelligence today is that we humans have eyes, ears, mouth, nose, and tongue, so many 'perceptions'. Although from the perspective of Attention, these 'perceptions' can be tokenized and put into Transformer, the output doesn't seem to be ideal. The fundamental challenges are data issues and the corresponding learning paradigms."

Wang He proposed that in the short term, simulation and synthetic data are the core means to speed up exploration; in the long term, the scale of humanoid robots in the real world must continue to expand rapidly. Only when there is a large enough "robot population" and the improvement of capabilities promote each other can truly powerful embodied large models be developed.

This mismatch in the underlying architecture makes the industry realize that to achieve a breakthrough in embodied intelligence, it may be necessary to innovate from the root of the architecture rather than making minor repairs within the existing framework.

Data remains a bottleneck, and the demand is growing

Signal 4: There is no perfect data, only suitable choices

"Data is the fuel for embodied intelligence", which is the consensus of the round - table forum. However, there is no unified answer to the question of "what data to use". Due to the different advantages and disadvantages of different data types, enterprises generally adopt the strategy of "multi - source integration and on - demand selection", matching the most suitable data sources according to the task scenarios. Real - robot data is the most "authentic" choice as it can directly reflect the interaction laws of the real physical world, so it has become the first choice for fine - operation scenarios. The Xinghai Map team where Zhao Xing works adheres to collecting data in real - world scenarios, regarding authenticity and quality as the starting point for data collection of real robots. Luo Jianlan, the partner and chief scientist of Zhiyuan Robotics, also emphasized that Zhiyuan Robotics adheres to using real data and conducts data collection in real scenarios rather than relying solely on data collection factories, exploring a way to build a data flywheel by having robots generate data autonomously. Simulation data, with its advantages of "low cost and scalability", has become the main force for low - level control training. Wang He believes that in reinforcement learning, it is difficult to repeatedly test many extreme scenarios (such as robot falls and robotic arm overloads) on real robots, while simulators can quickly generate a large amount of similar data to help models learn coping strategies. In his view, the simulator is not a negation of the real world. Starting with the simulator can provide a good base controller for embodied enterprises, enabling us to turn the data flywheel in the real world.

The Acceleration Evolution team led by Cheng Hao also adopts a similar strategy. They first use simulation data to enable robots to master basic motion control capabilities and then use real - robot data for fine - tuning to adapt to real - world scenarios. "One of our goals in training with simulation data is to enable robots to obtain more real - world data later. Only with real - world data can the overall capabilities be further improved." In Cheng Hao's view, this is likely to be a process of continuous improvement.

Video data has become an important supplement for training base models. Wang Zhongyuan, the dean of the Zhipu Institute, believes that the logic of "training base models with video data" is similar to how children recognize the world by swiping their mobile phones. They first learn about the world through videos and then improve their skills through real - world interaction experiences. These video data contain multi - dimensional information such as time - space, causality, and intention, and can be obtained on a large scale. They are the "optimal compromise solution" when there is a lack of a large amount of real - robot data. However, when the Embodied Learning Society asked "how to solve the problem of fine - grained tactile and force - control data from video learning?", Wang Zhongyuan also admitted that there is indeed a lack of force feedback and tactile information in videos, but this does not affect their value. The Embodied Intelligence Laboratory of the Zhipu Institute is also equipped with data - collection devices with force feedback. Video data is mainly used for "laying the foundation" and needs to be combined with other data for targeted optimization and fine - tuning.

Signal 5: Embodied enterprises need data in terms of "quantity", "quality", and "variety"

As embodied intelligence penetrates into complex scenarios, the industry's demand for data is constantly escalating. There is a need for large "quantity", high "quality", and more "variety", resulting in an ever - growing "appetite for data".

Firstly, there is a strong demand for "quantity". "Internet - scale" data has become the common expectation of the industry. For example, Zhao Xing believes that the large - scale data can drive the evolution of models and the realization of intelligence in reverse. Wang Zhongyuan also said that "better embodied large models may only appear after a large number of robots solve specific problems in real - world scenarios and accumulate 'Internet - scale' data for embodied intelligence." In other words, without enough data, models are like underfed children who can't run fast or grow strong.

When the industry cheered for the 270,000 - hour real - robot dataset built for Generalist, which was suspected to have touched the so - called law of scale, Wang Zhongyuan frankly told the Embodied Learning Society that "hundreds of thousands of hours of data still cannot be called a large amount of data, and it is far from the ChatGPT moment."

Image source: Zhipu Institute

Besides "quantity", there is a pursuit of "quality". The view that "high - quality data is more valuable than a large amount of low - quality data" has gradually become the mainstream. Wang Qian believes that although data is very important, it is not simply a matter of "the more, the better".

In fact, language models have already proven that simply increasing the data scale may not bring the best results. High - quality and efficient data are the decisive factors. He believes that in the field of embodied intelligence, data quality can create a greater gap in performance than the total amount of data. Here, real - robot data at the top of the pyramid may be in small quantity, but it is likely to be the foundation - laying layer or the key factor to support the industry when facing challenges beyond simulation and video data.

Finally, there is a need for more "variety". The demand for multimodal data is becoming increasingly urgent. As the application scenarios of robots expand, single - type data can no longer meet the requirements. For example, in a home - service scenario, robots need to process multi - dimensional information such as vision (object recognition), hearing (command understanding), touch (sensing object softness), and force feedback (controlling action strength) simultaneously. Currently, the so - called multimodal capabilities in the industry mostly inherit the vision and language capabilities of base large models, and there are few modalities such as touch and force feedback in real - world physical interactions.

This growing demand for data variety also makes the industry realize that future data collection should not only record "what the robot has done", but also record "what has happened in the environment", "what feedback there is from the interaction", and "what humans need". Only in this way can models better understand the physical world and human needs. In the early days of technology, there were always those who tried to find the one and only correct path, hoping to pierce through the fog with a single bet. However, the complexity of embodied intelligence is reminding the industry that true intelligence doesn't grow from a single path; instead, it is "sculpted" through countless trials and errors, conflicts, and reconciliations. Imperfect models, incomplete data, and non - unified architectures may sound like flaws, but they are precisely where the most genuine vitality of embodied intelligence lies.

This article is from the WeChat official account "Embodied Learning Society". The author is Peng Kunfang, and the editor is Lü Xinyi. It is published by 36Kr with permission.