The "Four Berkeley Guys" made a rare appearance on the same stage. We've compiled the most high - profile embodied forum at WAIC.
Text by | Fu Chong
Edited by | Su Jianxun
During the 2025 World Artificial Intelligence Conference (WAIC), the most "eye - catching" Embodied Intelligence Forum was undoubtedly the theme event of the "Artificial Intelligence Interdisciplinary Science Forum" hosted by the Shanghai Qi Zhi Institute.
This forum brought together the so - called "Four Berkeley Scholars" in the field of embodied intelligence in China today - Wu Yi, Gao Yang, Xu Huazhe, and Chen Jianyu. All four scholars graduated from the University of California, Berkeley, and are currently engaged in work related to embodied robots.
Among them, Chen Jianyu founded Xingdong Jiyuan. Gao Yang is the co - founder of Qianxun Intelligence, and Xu Huazhe is the co - founder of Xinghaitu. Wu Yi serves as the chief scientist of the Reinforcement Learning Laboratory at Ant Group.
(Click on "Xingdong Jiyuan" and "Qianxun Intelligence" to view our previous reports.)
With these four scholars sharing the stage for the first time, the content of their sharing naturally revolved around several core issues in the field of embodied intelligence:
The bottleneck of embodied intelligence - "acquiring data". How can this difficult problem be solved?
From simple tasks (picking up and putting down) to complex tasks (tidying up a room), how should robots improve from the "brain" to the physical body?
Regarding the already - established "VLA algorithm", what are the non - consensus methodologies within it?
In addition to their identities as entrepreneurs or scientists from large companies, Wu Yi, Gao Yang, Xu Huazhe, and Chen Jianyu all serve as PIs (Principal Investigators) at the Shanghai Qi Zhi Institute.
Yao Qizhi is a Turing Award winner and the dean of the Institute for Interdisciplinary Information Sciences at Tsinghua University. In 2005, Yao Qizhi founded the "Tsinghua School of Computer Science Experimental Class" (Yao Class), which is famous for cultivating world - class computer science talents. The Shanghai Qi Zhi Institute was established in 2020, and Yao Qizhi serves as its dean.
Yao Qizhi, the dean of the Shanghai Qi Zhi Institute and the dean of the Institute for Interdisciplinary Information Sciences at Tsinghua University, delivers a speech; Photo: Shanghai Qi Zhi Institute
The following viewpoints are from the speeches of Chen Jianyu, Gao Yang, Wu Yi, and Xu Huazhe at the "Artificial Intelligence Interdisciplinary Science Forum", summarized, organized, and edited by "Intelligent Emergence":
Chen Jianyu: To obtain the best - quality data, embodied intelligence needs to learn from humans
I envision a future world related to robots, and I believe there will be three stages to achieve this vision.
In the first stage, robots will enter our production system and produce items such as mobile phones and cars that we need in our daily lives. This could contribute more than half of the current GDP.
In the second stage, robots will become the largest terminals and will be able to manufacture themselves.
In the third stage, robots can help humans expand the boundaries of their capabilities, such as the Mars colonization plan proposed by Elon Musk. In the long - term future, robots could even spread throughout the universe.
To achieve this, I believe the shortest path is to directly learn from human experience and data, as humans are currently the only general - purpose intelligent agents in the world.
The bottleneck of embodied intelligence mainly lies in how to make data and models more efficient. Building humanoid robots can make it easier for robots to learn from human learning paradigms.
Chen Jianyu and the "Embodied Intelligence Data Pyramid" he shared; Photo: Shanghai Qi Zhi Institute
Embodied intelligence has a data pyramid model that shows the sources of training data for embodied intelligence.
The top of the pyramid is the data collected through teleoperation, with a data volume of less than 10,000 hours. However, if we convert the data used to train language models into hours, it is about 10 to the ninth power hours. So, collecting data only through teleoperation cannot meet the data volume required for embodied intelligence.
The actual data volume required to train embodied intelligence is even larger than that needed for language models. Therefore, we must use human behavior data, which is the middle layer of the embodied intelligence training data pyramid.
We can collect first - person human data through terminals such as VR glasses and smart glasses.
The bottom of the pyramid is what we call "all data occurring in the human world", that is, extensive data on the Internet, such as video websites. Currently, the total duration of all videos on YouTube is estimated to be about 10 to the eleventh power hours. This type of data is readily available and extremely diverse.
Indeed, in many cases, we can use simulation. However, simulation has a fatal problem: there are no embodied intelligent agents like humans in the simulation to generate data.
Almost all intelligent codes and behavior data are generated by humans. If a simulation can construct such an intelligent agent, it means we have already created the "real thing". So, this is a chicken - and - egg problem. Simulation can basically only construct relatively passive physical interaction data.
Therefore, we should build humanoid robots to directly match human physical performance. For example, the newly released Xingdong L7 by Xingdong Jiyuan is 1.7 meters tall, close to the height of a human. It also has human - like arms, waist, head, and legs, which can better collect diverse human data.
Some people may be concerned about whether the cost of bipedal robots will be higher. I don't think we need to worry too much about this. For general - purpose robots, the most important factor in reducing the price is scaling up production, rather than simply reducing their degrees of freedom.
General - purpose humanoid robots have more application scenarios. As the scale of production increases, the cost will decrease significantly. However, specialized or simple - shaped robots have limited expandable scenarios, which also restricts their scaling up. Therefore, the cost reduction is rather limited.
Next, let's talk about how to build the model. The current mainstream VLA (Vision - Language - Action) model has some problems because, in essence, it is just pure cloning.
The first problem is that the model can only clone from a large amount of human behavior data and lacks the ability to extrapolate. This also leads to the second problem: it is difficult for robots to surpass human performance.
Therefore, embodied intelligence should refer to the human learning method.
First, model the entire world and form a cognitive understanding of the physical world, similar to what we call the "world model". Just like when we drive to an intersection, we slow down. Even without a large amount of data - based teaching, humans know to avoid hitting someone suddenly rushing out at the intersection.
Second, learn "reinforcement learning" from humans. For example, when learning table tennis, a coach's hands - on teaching is a "imitative learning" paradigm. However, this is not enough for a person to master such a high - difficulty skill. Therefore, one needs to adjust their posture based on the hitting situation during self - training to achieve the desired effect. This is "reinforcement learning".
So, our method is to combine the understanding ability of VLM and the generation ability of the world model into a unified model and apply it to embodied intelligence.
This is our first exploration, the PID model, which integrates the world model. The same model not only makes predictions but also generates behaviors. The closest tool we can find is a model similar to Sora, which is based on diffusion video generation, because it can generate very detailed behavioral environmental actions in the physical world.
Based on the Diffusion Policy, we also have tools to generate model behaviors well. In this way, embodied intelligence can make predictions about vision and other modalities. Next, we proposed the "Video Addiction Policy", which further expands our data. We use a large amount of Internet and video data for pre - training to further improve generalization ability.
Ultimately, we hope to truly apply model technology and data to real life through our robots of different forms. Through a series of technologies, robots can perform high - dynamic full - body movements, such as dancing, and can also complete operations, such as logistics sorting.
Gao Yang: Let robots' thinking combine "fast and slow"
Gao Yang, the co - founder of Qianxun Intelligence; Photo: Shanghai Qi Zhi Institute
The success of models such as ChatGPT today is based on having a vast amount of data. However, the data available for robots is currently very scarce. The largest publicly available dataset currently has less than one million trajectories, which is several orders of magnitude less than the text and image - text data on the Internet.
The core problem is how to solve the data bottleneck in embodied intelligence. I believe the most important way is the "data pyramid". That is, we should use data of different qualities and from different sources to increase the data volume.
Just now, Professor Chen Jianyu also mentioned the embodied intelligence data pyramid. I divide the embodied intelligence data into three layers: the lower layer is a large amount of Internet videos; the middle layer is human operation data; and the top layer is reinforcement learning data, which is the data used to allow the robot to interact with the environment after mastering a certain skill to correct its ability and achieve a success rate of over 99%.
What I want to say today is that after the embodied intelligence pyramid, we need to make further improvements in the hardware perception level and the model structure after data acquisition.
In terms of the perception level, the current VLA only has vision. However, for humans, touch is a very important modality. For example, when inserting a USB flash drive, a person doesn't necessarily need to look at the USB port. But if a robot has to stare at it to complete this task, its posture will be very strange.
The currently proposed "TactileVLA" concept adds touch to the VLA. Another example is when a robot is cleaning a blackboard. If it fails to clean it thoroughly on the first attempt, it will use the VLM to think about whether the writing on the blackboard is particularly stubborn and needs to be scrubbed again with more force.
Through the process of tactile input, tactile output, and tactile feedback, touch can be well integrated into the VLA model.
With touch, when an embodied intelligent agent picks up different objects, it can pick them up better through pre - trained knowledge. For example, the force used to pick up a fruit is different from that used to pick up an iron block.
We can also make more accurate judgments about the blackboard - cleaning task by combining the functions of touch, such as friction.
After obtaining a large amount of data through the digital pyramid, we also need a good data structure to enable the robot to learn correct knowledge from the current data, just like the Transformer architecture in large language models.
When we want a robot to make a vodka cocktail and it is faced with a large number of bottles and jars, embodied intelligence needs to decompose the action into several atomic actions that can be executed. However, if we only use VLA for reflective thinking, or the so - called System 1 thinking mode (a way of information processing and decision - making in the brain that is more intuitive and faster), the success rate will be very low.
We proposed OneTwoVLA, a model that combines System 1 and System 2 (systematic thinking in the brain, which is slower). After receiving a task, this model will autonomously determine whether the current task requires analysis or just the completion of the current action path.
Specifically, for example, if there is a hot - pot robot in front of a variety of ingredients. If you ask it to cook beef, it will cook beef. If you ask it to cook vegetables, it will notice that there are many types of vegetables in front of it and then stop to ask the user which type to cook. Through this model, the task can be decomposed at the structural level to achieve better results.
Wu Yi: The future of embodied intelligence is not just one intelligent agent, but Multi - Agent
Wu Yi, the chief scientist of the Reinforcement Learning Laboratory at Ant Group; Photo: Shanghai Qi Zhi Institute
Our ultimate goal is to let robots enter every household and perform very complex tasks.
However, even if we achieve all the current technologies, we may still not reach this vision. So, are we missing something in this process?
Since ChatGPT in 2022, large models could initially answer questions passively based on human instructions. By 2025, Agent intelligent agents emerged, which can answer very complex, macroscopic, and abstract questions and take the initiative to do a lot of work. In just three years, the development of large language models has been very rapid.
I think the robot field will also go through a similar process. For example, one day when I give an abstract task like "clean the room", the robot will call tools on its own to complete the task. So, this is an embodied intelligent agent (Embodied Agent), which works like an Agent but has a physical body.
We can also draw inspiration for the embodied intelligent agent from the construction of the Agent.
An AGI intelligent agent needs three abilities: planning, memory adjustment, and tool use. We hope that the embodied intelligent agent also has these three abilities.
The Agent is a Function Call (tool - calling) intelligent agent. Similarly, the embodied intelligent agent can also call different Functions. Specifically, the embodied intelligent agent will first perform logical reasoning, then write code, and then execute the code.
We can imagine that there is a four - legged robotic dog at home. Now, if we want it to turn off the light, but the height from the dog to the light switch is a bit high. It needs to step on a box to complete this action.
When interacting with the physical world, the robotic dog may find that stepping on the first box still cannot reach the height of the light switch. Then, all the code after this error - prone point is useless. The large model will start to think again from this point, write a new piece of code to find a box of the appropriate height, and then the robotic dog will execute the new code.
In this process, there is a software intelligent agent executing, and there is also a hardware interacting with the real world.
In summary, just as large models can evolve from ChatGPT to Agent, we hope that embodied intelligence can also evolve from robots to embodied intelligent agents.
Looking further into the future, we hope that in the future, there will not be just one embodied intelligent agent, but many embodied intelligent agents interacting, which is the concept of Multi - agent. For example, a robotic dog football team, where multiple robotic dogs play football together, involving competition and cooperation. There can also be similar human - robot interactions between humans and robotic dogs.
Finally, looking forward to the future, I think the future