HomeArticle

8 top figures in embodied intelligence discuss "non-consensus": data, world models, and how to spend money

富充2025-11-24 08:54
Even among the top practitioners in the country, non-consensus still exists. Different answers reflect the "first principles" and strategic priorities in the minds of each entrepreneur.

Text by | Fu Chong

Edited by | Su Jianxun

"If your company were given 10 billion yuan to promote the development of embodied intelligence, how would you spend this money?"

At the round - table forum of the 2025 Zhipu Embodied Open Day held on November 20th, the host posed such an open - ended question.

The guests facing this question are from 8 top domestic companies and institutions in the embodied industry:

Wang Zhongyuan, Dean of Zhipu Research Institute

Luo Jianlan, Partner and Chief Scientist of ZHIYUAN ROBOTICS

Wang He, Assistant Professor at Peking University and Founder of Galaxy Universal

Zhao Xing, Assistant Professor at the Institute for Interdisciplinary Information Sciences of Tsinghua University and Co - Founder of Xinghai Chart

Cheng Hao, Founder and CEO of Accelerated Evolution

Wang Qian, Founder and CEO of Independent Variable

Zhang Jiaxing, Chief AI Scientist of China Merchants Group

Zhao Dongbin, Professor at the University of Chinese Academy of Sciences

"I think 10 billion yuan is not enough," responded Cheng Hao, the founder and CEO of Accelerated Evolution, with a smile. The audience also burst into understanding laughter. "If there's only 10 billion, I'd find more friends to promote the embodied industry together. For example, I'd invest the money in Zhipu Research Institute."

Luo Jianlan, a partner at ZHIYUAN ROBOTICS, prefers to use this money to solve the current data bottleneck: "I'd build the world's largest self - evolving and self - closed data flywheel. 10 billion yuan can be considered a large amount, or it can be considered not much. But so far, no individual or institution has used 10 billion to do this."

In addition to "how to spend the money", the 8 guests also discussed topics of industry concern such as the world model, and put forward reflections and improvement ideas on the currently mainstream VLA paradigm.

To enhance the collision of views, an interesting "holding up signs to express opinions" session was set up at this round - table forum: guests need to hold up signs numbered 1, 2, or 3 to express agreement, neutrality, or disagreement.

Judging from the results of holding up signs, even among top domestic practitioners, non - consensus still exists. The most obvious divergence lies in the solution to the problem of "data scarcity".

Zhao Xing, co - founder of Xinghai Chart, and Zhang Jiaxing, chief AI scientist of China Merchants Group, advocate the importance of real - world physical data; Wang He, founder of Galaxy Universal, emphasizes that in places where real data is difficult to collect, synthetic data will play an important role.

Wang Qian, founder and CEO of Independent Variable, believes that fused data can be used, but appropriate data sources should be selected according to different tasks.

How to select and combine data to achieve a leap from quantitative to qualitative change? Different answers reflect the "first - principles" and strategic focuses in the minds of each entrepreneur.

The following are the highlights selected by Intelligent Emergence from the forum. The dialogues are organized by the author:

△ Round - table forum of the "Embodied Model Salon" at the 2025 Zhipu Embodied Open Day. Image source: Zhipu Research Institute

Host: Do you think the world model will be the key technology for embodied intelligence?

Wang He (Agree): I can only say that it depends on the definition of the world model. After the processing of a series of video - generation models such as Sora, the original definition of the world model in reinforcement learning has become increasingly blurred.

A current mainstream idea is to let robots learn through videos of human behavior. But there is a fundamental problem here: the physical structure of robots is very different from that of humans - whether it's a wheeled chassis or two arms, their dexterity and range of motion are different from those of humans.

Therefore, even if the model can generate realistic human actions, this kind of data is of limited practical help to robots.

However, looking to the future, predictive ability is indispensable for embodied intelligence. Robots must be able to, like humans, infer the current actions to be performed based on future goals and plan their actions.

So the conclusion is: the predictive ability represented by the world model is the core, but its training data must come from the robots themselves. Only through a large amount of robot data can we train an effective world model truly suitable for robots.

Wang Zhongyuan (Agree): The world model definitely plays a role in embodied intelligence, but it is not necessarily the fundamental base for embodied intelligence.

The world model we understand is not just video generation. When generating a video, although the next - frame image is also produced, what humans actually need is for the world model to be able to predict the next spatio - temporal state based on the previous spatio - temporal state.

For example, when I was about to answer just now, I needed to organize my answer based on the host's question and possibly based on Professor Wang He's answer, and then make the decision to pick up the microphone.

Host: In the field of general large models, a unified architecture like Transformer gave rise to the explosion of ChatGPT. However, the models of embodied intelligence have not reached the stage where "one large model dominates all". Currently, there are hierarchical large embodied models, end - to - end VLA, world models, etc.

Do you think the models of embodied intelligence will eventually converge to be dominated by a certain unified architecture?

Zhang Jiaxing (Neutral): I think if embodied intelligence really wants to move forward, the model level cannot follow the path from LLM to VLM in the past three years. Embodied intelligence needs an architecture completely of its own.

Just like human intelligence, there were actions first, then vision, and finally language. The VLA structure inserts a language between vision and action, which actually does not conform to the essence of our real human operations.

For example, when we drive a car, we can chat, listen to something, and watch the road at the same time. (Language is not involved in the act of driving itself.) This shows that vision and action are connected, and language does not necessarily need to be involved.

Currently, some leading teams, especially some leading teams in Silicon Valley, are working on new multimodal large - model architectures. Under this architecture, the original "Language First" state may become "Vision First" or "Vision Action First", which is a breakthrough worth looking forward to.

Zhao Xing (Agree): I strongly agree that we need a foundational model parallel to the large language model.

This foundational model is more likely to be a Large Action Model, and this Large Action Model relies on vision because vision is the most general perceptual sensor information in the world. On this basis, we can then add language.

This is quite similar to the law of biological evolution. In the world, there were mobile animals first, then they had vision, and finally, highly intelligent creatures like humans appeared.

And let me add one more thing. I think the embodied model and the large language model need to have a very different aspect, that is, I hope it will be a closed - loop model.

The large language model is more of an open - loop model. That is to say, the large language model is a question - and - answer system: you tell it a question, and then let it give an answer. There will be some thought chains in the middle, and if the answer is correct, it ends.

But embodied intelligence is different. Embodied intelligence doesn't just think through a series of processes and then perform an action. Instead, after performing an action, it immediately gets feedback from the world and then immediately adjusts its actions to perform the next one.

Luo Jianlan (Agree): I think embodied intelligence will ultimately be solved by an integrated system that includes VLA, the world model, and reinforcement learning, rather than relying on a single model.

Let me explain here. I strongly agree with what Zhang Jiaxing just said. The current VLA may not be the ultimate paradigm, but I think it will still have vision, language, and action in the future. That is to say, the general trend of VLA is correct, but it may not look like it does now, so I still use the term VLA.

At the same time, it also needs a world model that can reflect, predict, and "imagine" in the latent space. Of course, reinforcement learning is also needed in this system.

After these elements are combined and coordinated with the data flywheel in the real world, embodied intelligence can continuously self - evolve and learn.

Wang Zhongyuan (Agree): First of all, Zhipu Research Institute definitely believes that in the ultimate state, there must be a relatively unified - architecture model to solve various problems in embodied intelligence. This is also an important reason for our layout of the multimodal world model.

Of course, the amount of data required for this is obviously extremely large. I even think it may not appear within three or five years.

A better large embodied model may only appear after a large number of robots solve specific problems in real scenarios and accumulate data at the level of an "embodied - intelligence Internet".

Wang He (Agree): From the perspective of architecture, the Transformer we are talking about today, as a cross - modal attention mechanism, is very general. For example, you'll find that it can handle text, video, and audio modalities.

However, the problem with embodied intelligence today is that humans have multiple senses such as sight, hearing, smell, taste, and touch. Although from the perspective of attention, after tokenizing these senses, they can all be put into the Transformer, but the output doesn't seem quite right.

So if we gradually solve these problems, I think there can be a very unified paradigm in architecture in the future.

But I think the more long - term challenge at present is data. I highly agree with what Dr. Zhongyuan just said. Whether it's a video - generation model or a dialogue model today, they are all essentially based on a huge amount of Internet big data.

The problem with researching an Action - First model at present is that there are too few humanoid robots on Earth. This small number is not enough to support the exploration of an Action - First architecture and model.

So my view is that in the short term, we should rely on synthetic data to explore this direction, which will be faster than using real data. First, use this method to increase the ability points of embodied intelligence, and then the number of robots can increase, which can give rise to a truly powerful large model.

Cheng Hao (Agree): Since we focus more on motion control, we think about the embodied - intelligence model more from the perspective of robot motion control.

We hope to have an embodied model that can, based on requirements and the environment, output the actions for the next 100 frames at all times. Imagine, this might be an animation of a robot's movement.

Once this model works, the implementation of embodied intelligence will be much easier.

Why are we more concerned about the world model? Because a very core point here is that the world model will predict what will happen next, which includes what the robot itself wants to do actively and what will happen in an event.

For example, cooking is a very difficult task, but we can first use some hierarchical methods to let the robot start with simpler tasks like picking up a courier or a box.

However, this is indeed very challenging and far from being realized. So we think we might first use some hierarchical methods and create some agents to start implementing it.

Just as Professor Wang He said, once implemented, the number of robot "citizens" will definitely increase. Because implementation will create value, and then everyone will have the motivation, money, and willingness to build a large number of robot "citizens" and collect a large amount of data.

Then this data will, I think, feed back into the development of the large embodied - intelligence model.

Wang Qian (Agree): I think the mention of Transformer in the question is a bit misleading. Even in language models today, we don't necessarily use the Transformer architecture.

Of course, I understand that this question is discussing whether we will have a complete foundational model similar to GPT back then. From this perspective, I think we will.

We can learn two very important things from language models.

First, data is important, but it's not simply "the more, the better". In the era of language models, we've seen that simply increasing the data volume may not bring the best results. High - quality and efficient data are the decisive factors.

So although we also work on synthetic data, we still mainly use real data from the physical world because we believe that in the embodied scenario, data quality can create a greater gap in performance than the total amount of data.

Second, we need to build a Foundation Model. I believe that there must be a foundational model for the physical world, parallel to or independent of the virtual world.

The reason is that the characteristics of the virtual world and the physical world are very different. Fine physical processes such as friction, contact, and collision in the physical world are difficult to accurately describe with language or traditional synthetic data. So ultimately, what we need is a foundational model that learns directly from the physical world and can describe all these detailed and complex physical processes.

It should be able to control robots and at the same time be a world model. So, in our practice, concepts like the world model and VLA are not mutually exclusive: the same model can output actions and videos, etc. We regard this as a whole as the "foundational model of the physical world".

The reason for building a general model is that a general model learns the common structure across tasks, that is, some kind of "common sense" or "essential law". In embodied intelligence, this may be Newton's laws and object properties; in language, it's logic and common sense.

On the contrary, I think it's not about whether we should inherit the current multimodal models and use them as the basis for the embodied model. Instead, 5 to 10 years from now, the multimodal models from embodied intelligence may become dominant. That is to say, the multimodal models created with data collected from the physical world may, in turn, absorb the multimodal models created mainly with virtual - world data today.

This actually conforms to human cognition: the multimodal data we encounter in our lives is far less than the scale of the Internet, yet we can form a strong understanding of the world. One of the key reasons is that embodied intelligence can complete interactive perception and active perception through actions, thus better grasping the laws of the physical world in the dimensions of time and causality.

△ The situation of guests holding up signs at the scene shows the non - consensus on the issue of embodied intelligence. Image source: Zhipu Research Institute

Host: Several guests just emphasized the importance of data. Can you briefly introduce in one or two sentences what strategies you've adopted to deal with the data bottleneck?

Zhang Jiaxing: Our data concept is, first, to trust the data collected from the real physical world, the importance of which Wang Qian has mentioned many times.

Second, in the entire digital pyramid, we will focus more on the data collected with humans as the main body. This is the data with the lowest cost and the largest quantity, mainly used for pre - training.

Zhao Xing: We also base our work on real data. And we have three entry points.

The first entry point is authenticity and quality