HomeArticle

Exclusive Interview with Zhao Hang of Xinghaitu: A Popular Demo Doesn't Equal Generalization Ability, and the Outcome of Embodied Intelligence Still Depends on Data Volume

富充2025-08-13 11:33
In the past ten months, the development of the G0 model has become an important task for Zhao Xing and his team. Compared with small models suitable for demonstration, he hopes to develop a large embodied intelligence model with generalization ability.

Text by | Fu Chong

Edited by | Su Jianxun

At the bustling 2025 WRC (World Robot Conference), there were numerous cool demos on display. Amidst the flamboyant robot performances, at the booth of Xinghaitu, an embodied intelligence company, a robot was quietly performing the task of making a bed.

Some onlookers showed confused expressions. They couldn't understand why such a seemingly simple task for humans needed to be demonstrated so elaborately.

"Making a bed is a long - term task that encompasses various difficulties. It tests the robot's abilities in operating flexible objects, whole - body control of the model, and the generalization ability to tidy up various messy beds," Zhao Xing, the Chief Scientist of Xinghaitu and an assistant professor at Tsinghua University's Institute for Interdisciplinary Information Sciences, told Intelligence Emergence at the exhibition site.

At this time, the staff randomly messed up the quilt, and a spectator gave the instruction to make the bed. The robot then started working.

Although it seems like a simple task, the robot mobilizes 23 degrees of freedom in its whole body and usually achieves the task in three steps: First, it moves to the optimal working position through its chassis; then it adjusts its torso up and down and tilts it to find a suitable working angle; finally, it grabs the quilt with its mechanical arm, pulls it outwards, and flattens it.

The three steps also affect each other: If the robot doesn't move to the right position at the beginning, it won't be able to grab the quilt; even if it does move to the right position, if the quilt is in the middle of the bed, the robot has to lean its torso forward to "reach" it; when grabbing, since the quilt is heavy and the robot can't just rely on its arm to pull, it also has to move its whole body to flatten the quilt.

Behind this demonstration is the newly released VLA (Vision - Language - Action) end - to - end foundation model G0 by Xinghaitu.

When talking about the reason for training this model, Zhao Xing introduced to us that the small models Xinghaitu used before could only be used for demonstrations and didn't perform well in large - scale applications. To obtain real generalization ability, they still had to develop large models.

The robot bed - making demo displayed by Xinghaitu at the WRC. Photo provided by the interviewer.

Currently, embodied intelligence is still in the "non - consensus stage". The Scaling Law of large language models has been verified, indicating that a quantitative change in data can lead to a qualitative change in model capabilities. However, whether this law can be replicated in the field of robotics remains to be answered.

This is also the reason why Zhao Xing has devoted most of his energy in the past ten months to data engineering.

Data engineering includes training and assessment of data collectors, real - machine teleoperation data collection, as well as a series of processes such as data uploading, cleaning, and annotation. It is a typical "dirty and tiring job". Since the entire process has not yet formed a standardized procedure, Zhao Xing often deals with the feedback from front - line data collectors, and his work intensity has been very high in the past ten months.

A person working at Xinghaitu told us, "Teacher Zhao is our overtime buddy. We can often see him working late at night."

In his opinion, a foundation model with generalization ability cannot do without solid real - machine data collection and cleaning. Instead of spending time and energy on "showy" demonstrations, it's better to face the fundamental problems of embodied intelligence directly.

With the release of G0, Xinghaitu is also about to open - source a 500 - hour real - machine dataset collected in the open world and real scenarios.

Zhao Xing hopes that by opening up the dataset, he can provide a high - quality benchmark dataset and evaluation standard for the embodied intelligence industry, making it easier for different teams to compare algorithms and verify effects on the same data, thus promoting the development and accumulation of technology.

At the same time, the open - source dataset can significantly shorten the development chain from purchasing a robot to deploying a model, reduce the cost of repeated data collection and annotation, and help universities, research institutes, and enterprises enter the experimental and iterative stages faster.

In July 2025, Intelligence Emergence exclusively reported that Xinghaitu had successively completed strategic financings in Series A4 and A5. Since the start of the Series A financing in 2025, Xinghaitu has completed a financing scale of nearly 1.5 billion RMB.

During this WRC, we conducted an exclusive interview with Zhao Xing. From both academic and industrial perspectives, he shared his views on popular issues such as the generalization of VLA and the world model. The following content is from the interview, edited by the author.

Zhao Xing, the Chief Scientist of Xinghaitu and an assistant professor at Tsinghua University's Institute for Interdisciplinary Information Sciences. Photo provided by Xinghaitu.

Large models are the foundation of the generalization of embodied intelligence, and high - quality data is even more important

Intelligence Emergence: During the WRC, Xinghaitu presented a demo of embodied intelligence bed - making. Compared with many flamboyant performances on - site, it didn't seem that "fancy". How did you decide to do this demonstration at first?

Zhao Xing: Actually, Xinghaitu is not a company that is very good at making demos. Compared with cool actions, we want to show the progress of intelligence more.

Specifically, Xinghaitu has trained the embodied large model G0 of VLA, and at the same time, we are also writing some technical reports. For this purpose, we need to collect data and adjust the model in different places, which are all very down - to - earth tasks.

So, it was only one or two weeks before the opening of the WRC that we decided to do the bed - making demo. Because making a bed is a demonstration that combines various difficulties.

When demonstrating this demo, the user first gives the instruction to make the bed to the model through the TV interface; after receiving this instruction, the model will observe, understand, and plan its task; during the language planning, the robot will execute the task synchronously.

At this time, the model will control the 23 degrees of freedom of the robot's whole body, and the actions are achieved in three steps.

First, it moves the chassis; then the torso can be raised and lowered and tilted; finally, it uses the mechanical arm to operate the object.

These three actions actually affect each other. If it doesn't move to the right position at the beginning, it won't be able to grab the quilt; after moving to the right position, if the quilt is in the middle of the bed, the robot's torso has to lean forward to "reach" it; finally, when grabbing, since the quilt is usually heavy and can't be pulled just by the arm, the robot also has to move its whole body to flatten the quilt.

So, this demo was not carefully designed by us, but it is different from other demonstrations. Technically, its whole - body control and flexible object operation are difficult, which shows the ability of our VLA end - to - end large model.

Intelligence Emergence: How does the G0 model perform? What problems does it solve?

Zhao Xing: Based on Xinghaitu's open - scenario dataset and our proposed three - stage VLA training framework (cross - ontology pre - training, single - ontology pre - training, and post - training), the G0 model outperforms PI 0 by about 20% on average. (Note by the author: PI 0 is a robot - control VLA model developed by the US embodied intelligence company Phisical Intelligence.)

In addition, we found that cross - ontology pre - training based on open - source data performs okay on basic desktop tasks, but poorly on complex whole - body movement control tasks.

Xinghaitu's open dataset fills the above gap. After using this dataset, complex whole - body movement tasks will perform better. This improves the effect of cross - ontology pre - training in the industry.

Intelligence Emergence: What is the background of developing the G0 model?

Zhao Xing: In October last year, about one year after Xinghaitu was founded, Xinghaitu started training this model.

From our experience in previous R & D, small models can be used for demonstrations, but it is very difficult to apply them on a large scale. Therefore, we hope to develop large models with generalization ability more.

Intelligence Emergence: What are the specific difficulties in the generalization of the model?

Zhao Xing: Specifically, there are three aspects.

One is the difference in the objects to be operated. For example, when grabbing from a fruit plate, there are grapes and tomatoes. They differ in texture, color, and softness, and even within the same category, there may be differences in size.

Secondly, there are differences in scenarios and environments. Even for the same type of milk tea, when it is made in different stores, the generalization ability will be affected due to the different layouts of the surrounding environment.

In addition, it also lies in specific tasks and actions. For example, when performing a grabbing action, if there is a very thin piece of paper on the table, it is difficult to grab it all at once. We need to first pick up the edge and then take it. This action is difficult to define in language.

These are problems that algorithms based on programming have not been able to solve well, and they are also bottlenecks that prevent robots from being widely applied in various scenarios.

But for humans, these actions can be achieved subconsciously. So, compared with small models, only large models can achieve such generalization ability, which is also the reason why we develop large models.

Intelligence Emergence: The Scaling Law of large language models emphasizes that a quantitative change in data can lead to a qualitative change in model capabilities. So, do you believe that it can also be replicated in embodied intelligence models?

Zhao Xing: Language models have proven that large models and large amounts of data can achieve good generalization ability. I think this is the first - principle of AI.

But in the field of robotics, we have observed the signs of generalization ability. Therefore, we decided to develop an embodied large model at the end of 2024.

I believe that after gathering the three elements of model structure, algorithm, and data, the embodied intelligence model will also have the same ability as language models.

Our G0 model uses a Transformer - based training method. Although people are not very satisfied with the framework structure of Transformer now, and I also think that there will be changes in the future, its usability is still the strongest in the short term.

The algorithm has the possibility of change, which mainly depends on smart researchers. I think there is no problem in this aspect for our team. We can develop it ourselves or keep up with the latest progress.

Finally, we found that what everyone lacks is data.

Just like Sora amazed people, but it was found that the Diffusion Transformer algorithm and model used by Sora were all existing before, only the amount of data was larger. This also makes more people believe that data is more important.

High - quality data is important. At present, we will do data engineering ourselves

Intelligence Emergence: So, in fact, in the past ten months, has your work focus been on data?

Zhao Xing: I think so. It mainly lies in promoting the collection of high - quality data. After all, we can't buy ready - made robot data now.

Data collection is different from scientific research. In scientific research, for example, to improve the algorithm, we need a smart brain. Sometimes, if we don't work for a week but come up with a very good algorithm, we can achieve the effect.

But data collection is a very basic job that requires perseverance.

The specific work is very complicated. Data collectors will take the robots to different scenarios to collect data, but they need to be trained and tested first to ensure that they can collect high - quality data.

During the on - site data collection process, a large number of problems need to be solved, such as sudden situations with the machine and the network. I will also coordinate these issues. After collecting the data, there are also subsequent processes such as data uploading, cleaning, and annotation.

Intelligence Emergence: It seems that data collection work is mostly basic work, or it can be called "dirty and tiring work". Do you have to do it all by yourself?

Zhao Xing: The classmates in the team are very capable. But because the field of embodied intelligence is too new, there is no SOP that can be directly submitted to them.

As we all know, the data annotation industry in the past was quite mature. We could send all the data to an outsourcing company. After specifying the time, accuracy, etc., we could just wait for acceptance.

But robot data collection is related to hardware and scenarios, etc., and the entire process is very long. At present, there is no ready - made experience, so we can only do it ourselves.

Intelligence Emergence: What kind of dataset can be called high - quality?

Zhao Xing: It should be real and diverse.

For example, when we scramble some objects on the desktop, we need to consider whether the mess is real. Many teams and companies are building data collection factories, and the constructed home environment is clean and spotless. But in the real environment, things are piled up randomly, which is completely different from the data collection factory. So, we decided to collect data in the real environment.

Secondly, whether each mess is different is the "diversity of mess". Just like training a large language model, we need to scrape all the corpora on the entire Internet. When we train the embodied intelligence model, we also need to collect all the data we can think of, rather than focusing on a single task.

So, we have defined five types of scenarios: home, hotel, factory and warehouse, supermarket, and restaurant. We choose real scenarios where humans operate a lot to find applications for robots.

Of course, this is also a continuous development process. Currently, there are different types of data such as simulation and real - machine data. We will invest more resources and energy in the future to find a good "data recipe" (Data Recipe) for the ideal combination ratio.

Intelligence Emergence: What is the significance of open - sourcing a 500 - hour dataset from the data you collected?

Zhao Xing: I think there are mainly two aspects.

One is to contribute a high - standard dataset and data evaluation standard to the industry, which may help establish enterprise standards in this field.

In the field of robotics, the ontology brands and configurations of each R & D team are different, the tasks they perform are different, and the algorithms are different. It is very difficult to compare with each other.

So, an open - source dataset can control other variables and make it convenient for everyone to run different algorithms. This can not only give us feedback on the dataset but also jointly promote the progress of the field.

For example, if another team has an algorithm of their own, after running this algorithm on the data they collected and verifying its ability, this team can also run it on our open - source data to see the best practice of the algorithm through the results.

Second, Xinghaitu hopes to build a community of co - developers. Universities, research institutes, enterprises, etc., can all use our data to do experiments.

In terms of the company's model, we do both the whole machine and intelligence. So, we found that after users buy a new robot, they usually have to go through a whole chain of work including various parameter settings, data collection, data cleaning, and model training.

We hope that by open - sourcing the data, we can shorten the development process for everyone. It is equivalent to providing a set of corresponding development tools for users who buy our robots, which can reduce the previous work.