HomeArticle

How can embodied intelligence reach the "ChatGPT moment"? The director of BAAI, a Tsinghua professor, and three founders had a chat.

富充2026-02-13 18:49
Before pursuing generalization, achieve closed-loop verification in a single scenario and build a real-device data flywheel.

Text | Fu Chong

Editor | Su Jianxun

Embodied intelligence is waiting for its "ChatGPT moment". However, there is still a lack of consensus in the industry regarding the specific definition of this moment.

Recently, at the round - table forum of the technology open day of Yuanli Lingji, five front - line practitioners in the fields of AI production, academia, and research discussed this issue and shared their insights. They are:

Wang Yu, a tenured professor in the Department of Electronic Engineering at Tsinghua University

Wang Zhongyuan, the dean of the Beijing Academy of Artificial Intelligence

Jiang Daxin, the founder & CEO of Jieyue Xingchen

Gao Jiyang, the founder & CEO of Xinghai Tu

Tang Wenbin, the co - founder & CEO of Yuanli Lingji

Jiang Daxin, the founder & CEO of Jieyue Xingchen, first proposed that the defining standard of the "ChatGPT moment" is "zero - shot generalization" — even when given instructions it has never seen before, the AI can answer questions and complete tasks. This is exactly the ability demonstrated by large language models.

However, Jiang Daxin immediately pointed out that since the generalization of embodied intelligence involves more dimensions such as scenarios, tasks, and objects to be manipulated, it is still very difficult for robots to meet this standard.

As the CEO of a robotics startup, Gao Jiyang further explained the difficulties in the commercial implementation of embodied intelligence: For large language models, "the model is the product", with mobile phones and computers as terminals and the Internet as the dissemination channel. In contrast, embodied intelligence has to go through a longer industrial chain, including the whole machine, supply chain, real - machine data, and offline delivery, all of which are indispensable.

Based on the above problems to be solved, Tang Wenbin, the co - founder & CEO of Yuanli Lingji, proposed a more achievable "ChatGPT moment for embodied intelligence" at present: First, solve all the problems in a closed - loop within a limited scenario and ensure that the ROI is favorable.

His reason is simple: The ChatGPT moment showed people the usability of language models as tools. For such a change to occur in embodied intelligence, it also needs to transform from a toy and a research project into a useful thing.

Therefore, the round - table forum reached a preliminary consensus on "the current development direction of embodied intelligence": Before pursuing stronger generalization, first make a vertical scenario work, let the robot generate a real - machine data flywheel in actual operations, and then use the data to feed back the model and system iterations.

This approach also explains the path choice of Yuanli Lingji, the organizer of this round - table forum: Before the data flywheel starts turning, there needs to be a unified standard to evaluate the effects of real machines. Therefore, before releasing its own model and ontology, Yuanli Lingji jointly launched the real - machine evaluation benchmark "RoboChallenge" with HuggingFace.

Yuanli Lingji was founded in March 2025. Its founder, Tang Wenbin, is a former co - founder of Megvii Technology. The company's core creative team also includes several former core members of Megvii Technology. In less than a year since its establishment, Yuanli Lingji has raised nearly 1 billion yuan in total, and its shareholders include institutions such as Alibaba, NIO Capital, and Lenovo Capital and Incubation Group.

On February 10, this startup favored by the capital market "submitted its first model DM0" since its establishment, topping the RoboChallenge list with 2.4B parameters. Naturally, questions also followed — "Can the person who initiates the evaluation also be a participant?" Regarding the consideration of releasing the benchmark before the model, the importance of real - machine evaluation, and the questions from the industry, Tang Wenbin also responded one by one at the round - table forum.

The following is the content of this round - table dialogue, sorted out by the author:

△ Guests at the round - table forum, Photo: Yuanli Lingji

Host: From a global perspective, what are the mainstream technical routes for our embodied intelligence models, and what stage are we at now?

Wang Zhongyuan: Behind the popularity of embodied intelligence, I see quite a few concerns. Although the hardware itself is progressing rapidly, there are still a series of problems to be solved in terms of continuous and stable operation, safety, battery, etc.

In terms of models, although a series of embodied models have been released in the past year, we believe that we are still far from the ChatGPT moment for embodied intelligence. Especially after the embodied intelligence models and hardware are deployed on real machines, we find that there is still a relatively large gap between the current situation and the large - scale application we truly hope for.

Currently, the technical routes of embodied models are still in the overall development stage. Commonly discussed ones include the modular approach such as VLM plus control, or the end - to - end VLA, as well as the world model, which is very popular in current research. However, I believe that none of these have reached the stage where we can proudly say that embodied intelligence has achieved a complete breakthrough.

Therefore, it is very likely that in the future, we will see the situation where one scenario after another is solved through VLA + reinforcement learning. First, start working, accumulate more data on real machines, form a data closed - loop, and finally solve the problem of generalization.

Wang Yu: I focus more on hardware, including computing power, frameworks, edge, and infrastructure. From my perspective, although the current robot applications have made great progress, they are still limited to a workbench. Basically, it is still quite difficult to coordinate the "brain" and "body" to complete a slightly longer task, especially when it involves multiple modalities.

Our group often discusses to what extent embodied robots should be able to perform tasks. For example, for the task of tidying up a room, it is not just about folding clothes. The robot needs to observe the overall state of the room, figure out how it should be tidied up, and then start working bit by bit until the whole room is clean. This is a very difficult problem.

Of course, the model definitely needs to make breakthroughs. But I'm also thinking that if we want to complete such complex tasks, will the room itself need to change as well? I come from a hardware background, so sometimes I think that when building a house, the architecture should be adapted to the future life with robots, as it was originally only adapted to human life. Just like vehicle - road collaboration, we can also have infrastructure to assist robots.

 

Host: Professor Wang mentioned that the next - generation housing standards may need to incorporate the robot dimension. Since we are talking about the infrastructure level, what do you think are the advantages and disadvantages between China and Silicon Valley in the field of embodied intelligence?

Wang Yu: The United States started earlier in terms of models, data, etc., and has made some investments and breakthroughs in applications. However, when it comes to actual implementation, I still firmly believe that China can catch up quickly. Especially now, China has made stronger investments in the field of embodied intelligence than the United States.

Some people say that embodied intelligence is a bubble. Personally, I think it's a good thing that China has a greater investment intensity in this direction than the United States. Because China has a complete industrial chain and supply chain. If we open up more applications and increase investment in models and applications, it is possible to achieve faster breakthroughs in the field of embodied intelligence than the United States.

In addition, I think the interaction between the academic and industrial circles in China is gradually increasing. Just like me sitting here today. In fact, when the industrial circle encounters problems, it will interact with research institutions. It's no longer the case that professors sit in the office reading papers and doing research. I personally think this kind of interaction is gradually becoming similar to that in the United States, that is, the integration of industry, academia, and research to promote the development of embodied intelligence.

 

Host: We've noticed a phenomenon. The "Super Bowl", known as the "American Spring Festival Gala", had a lot of LLM (promotions). But in our country's Spring Festival Gala, most of the performers on stage were robots. Dean Wang, do you have anything to say about this topic?

Wang Zhongyuan: I'd like to share two little stories I heard.

The first one was told to me by an investor. Investors in the field of embodied intelligence in the United States often check if there are Chinese people in the startup team. They believe that having Chinese people can ensure the success of the startup's embodied intelligence project.

The other story is that when we were doing the iteration of the embodied intelligence model, a very painful problem was that the hardware was often damaged. After the hardware was damaged, it usually took about two weeks to repair it. But we heard that in the United States, it takes three months to repair damaged robot hardware. This made us feel much better.

So on the one hand, we can see that China indeed has an advantage in manufacturing, which is also an advantage for us in the field of embodied intelligence. On the other hand, the entire industry is still in its early stage, and everyone is in the stage of rapid development and iteration. So it's far from the time to distinguish who is better or worse.

 

Host: We've talked about the "Chinese - involvement" index in US embodied intelligence startups. From the perspective of the entire AI industry, an important milestone is the "ChatGPT moment". So what do you think the "ChatGPT moment for embodied intelligence" is? Mr. Jiang Daxin from Jieyue Xingchen, you should have a deeper understanding of the "ChatGPT moment".

Jiang Daxin: First, let me talk about how to define the "ChatGPT moment". I think the most iconic feature is "zero - shot". Zero - shot generalization means that when given any instruction, even one it has never seen before, the AI can answer the question. This is completely different from the previous natural language processing, which is why the "ChatGPT moment" made everyone so excited.

However, if we compare natural language with embodied intelligence, I think the "ChatGPT moment for embodied intelligence" will be more difficult.

First of all, from the definition of the problem itself, the generalization of embodied intelligence can be defined from different dimensions. Different dimensions of generalization lead to a lack of consensus among different people on the "ChatGPT moment for embodied intelligence".

The first dimension is scenario generalization. For example, whether it is a closed - loop scenario, a semi - closed scenario, or a fully open scenario. The second dimension is tasks, such as navigation tasks, grasping tasks, or housework. The third dimension is the generalization of targets. Even for a simple grasping action, the objects to be grasped can be divided into steel and flexible materials.

Secondly, from a technical point of view, embodied intelligence involves computer vision, but there is no consensus on some very fundamental issues. For example, how to encode vision, how to perform self - supervised pre - training, and how to perform reasoning in a 3D space. I think these issues may still require some breakthroughs before we can reach the ChatGPT moment.

 

Host: For the "ChatGPT moment for embodied intelligence", the definition is crucial. So how do the two guests who are specifically engaged in embodied intelligence define the "ChatGPT moment for embodied intelligence"?

Gao Jiyang: I think this question is really worth discussing. I think there is a more fundamental problem, that is, although both the embodied intelligence and language model industries are based on the innovation and breakthrough of AI technology, when it comes to the specific industries, they are quite different.

The chain from the generation of embodied intelligence technology to product planning and then to commercial implementation is longer. It involves the upstream and downstream parts supply chain and data, and the data for embodied intelligence did not exist before. Then, algorithms need to be developed. After that, we will also find that the channels and terminals are different from those of large language models. The terminals of large language models are mobile phones and computers, and the channels are social media dissemination.

So you will find that in the entire industrial chain, the rarest and the only missing link for large language models is the model itself. So the model is the product. Once the model is good, the entire commercialization and industrialization chain will be in place.

For embodied intelligence, the supply chain and parts in the above - mentioned links are still very immature. Without the whole machine, there will be no good real - machine data. The terminal of embodied intelligence is the robot itself, which also involves the establishment of offline channels.

Going back to the previous question, in my opinion, from the perspective of the business line, the definition of the "ChatGPT moment for embodied intelligence" should be the moment when we really see its commercial value within certain limits.

I think 2026 will be a year of change. After two years of preparation, the whole machine and supply chain have changed a lot. We also have a lot of data. The introduction of reinforcement learning in post - training, VLA in pre - training, and the recent World Model in the model and algorithm levels have brought many new changes to the generalization ability of pre - training and the success rate of post - training.

So I think this year is the year for the application to form a closed - loop. In the first half of 2025, we clearly saw the start of the development of intelligence, and in the second half of 2025, the development of intelligence accelerated significantly. We can refer to the number of open - source models in the open - source community as a key indicator.

2026 will be a year of explosion in intelligence. The result of the explosion will definitely lead to the spill - over of applications in certain application fields, and it will also be accompanied by the supply chain and the whole machine. Especially in China, it is significantly stronger than the United States, with a cycle 5 to 10 times faster and a cost 5 to 10 times lower.

Tang Wenbin: I think the requirements for the "ChatGPT moment" mentioned by Jiang Daxin are quite high. It's actually the AGI moment. Today, let's think about what the biggest shock that ChatGPT brought to us was. We once regarded it as a toy, but at that moment, we considered it a tool, something that could be used.

So in my mind, the definition of the "ChatGPT moment for embodied intelligence" is the moment when it becomes useful and reliable. This also goes back to what our company's mission aims to achieve.

Our definition of "useful" is very simple. It can be in a limited scenario, but it must truly solve all problems in a closed - loop and be able to make a profit in terms of ROI. Only when the ROI is clear can it be applied in batches.

When it meets this definition of "useful", we truly transform a toy or a research project into a tool. At this time, I think it is the "ChatGPT moment for embodied intelligence". I think the progress of the model's ability is indeed very significant, so this moment is not far away.

Of course, after the ChatGPT moment, there will be the DeepSeek moment, which means when it can go mainstream. Today, embodied intelligent robots can work in warehouses and factories, but ordinary people may not be able to feel their presence. Maybe the DeepSeek moment means that the whole public can feel their impact. The moment when it can move from industrial logistics to commercial use and to the consumer market will be a bit later, but I don't think it will be too far away.

 

Host: During the Megvii period, the core creative team of Yuanli Lingji experienced the AI 1.0 era. Now, in the era of embodied intelligence, you didn't release a model at the beginning, but instead released a benchmark like RoboChallenge first. So what was your thinking behind this?

Tang Wenbin: The model is a product, the result of changes in models, algorithms, architectures, and data. Currently, the entire technical architecture is in great shortage, whether it's data, the useful hardware mentioned by Dean Wang Zhongyuan, or the evaluation criteria.

In today's entire embodied intelligence industry, all of us who work on algorithms know that if you don't know how to evaluate it, you definitely can't make it progress. Currently, the evaluation criteria we can use are LIBERO, SimplerEnv, and RoboTwin, but their scales are very small. Many benchmarks have been almost fully exploited, but does a score of 99+ points represent the current real ability? Obviously not.

So we think there is a great need for real - world, large - scale, real - machine evaluations based