HomeArticle

From Chaos to Order: The Data Supply Revolution and Skill Structuring Practice of Embodied Intelligence | 2026 AI Partner·Beijing Yizhuang AI + Industry Conference

未来一氪2026-05-22 11:10
The implementation of embodied intelligence requires high-quality structured data. A five-layer data compilation pipeline is established to build an ecosystem.

Large language models can achieve the Scaling Law by piling up data. However, robots face a physical world that is dynamic, multi-modal, and strongly time-sequentially correlated. Piling up messy data together won't train a reliable model. In the industrial path from chaos to order, quality is more important than quantity.

When robots enter factories and scenarios, the real challenge lies not in the model itself, but in the data. Xu Liangwei pointed out that the data for embodied intelligence is not multi-modal assets with a tight coupling of time, space, and task intentions. Zhiyu Jishi proposed a five-layer data compilation pipeline model, with clear quality indicators for each layer. Only by building a data base ecosystem, where the ontology side, model side, and industry side each perform their respective duties, can high-quality data from the physical world truly circulate and support the large-scale implementation of embodied intelligence.

The following is the content of the speech, sorted and edited by 36Kr:

Xu Liangwei | CTO of Zhiyu Jishi

Hello everyone. I'm the co-founder and CTO of Zhiyu Jishi. Today, I'd like to share with you the data supply revolution and skill structuring practice of embodied intelligence. The title is "From Chaos to Order". Why "From Chaos to Order"? The arrival of embodied intelligence has made us realize that the previous data practices in large language models, autonomous driving, or all past AI applications are not sufficient in embodied intelligence. Today, I'll mainly talk about what work Zhiyu Jishi has done in this regard. We'll mainly discuss two topics. The first is how to conduct standardized industrial practices in the data of embodied intelligence. The second is how to combine data with the model ontology, industry, and scenarios to form an ecosystem, rather than just dealing with data alone.

Let's directly talk about the implementation of robots. In 2026, we can see that some robots have gradually moved from small-scale samples to the industry. Previously, we only considered how to present the algorithms in the laboratory through videos or on-site demonstrations. Now, the situation is completely different. We've moved the robots from the laboratory to real scenarios. Previously, we only needed to make the robots move and complete specified tasks. Now, we need to consider how to enable robots to face uncertain, dynamic, and multi-modal scenario data inputs and still be able to continuously and stably interact with the physical world. At this time, we need to consider how to generate a stable supply.

There's an old saying that's quite right. The model determines the upper limit of a robot's capabilities. It determines what a robot can do, but it's difficult to determine what a robot can do in the worst environment. Because even humans may not handle things well in new scenarios. At this time, we need to consider how to send the data from real scenarios, which may include ontology data, environmental data perceived by the robot, and even the robot's tasks and logs, into the entire closed-loop of robot training. Only in this way can we turn what was previously at the small-scale sample level into something that can truly be implemented in the industry.

Previously, when people were working on language models, they talked about the Scaling Law, hoping that more and more data would make the model better. There's nothing wrong with this in itself. However, embodied intelligence is different from the previously structured data. In the field of multi-modal and continuously strongly correlated data, we've found that if we simply pile up data, such as a large amount of data from the Internet mixed with data related to robot operations and any kind of simulation data, can we train a better model? There's a possibility, but currently, we can hardly say that piling up messy and unregulated data together can train a better model. We need to consider not only the quantity but also the quality. This quality is reflected in two aspects. On one hand, it's in the data collection. On the other hand, it's reflected in the entire process from data collection, quality inspection, pre-labeling, the closed-loop of the human-machine loop, to data post-processing, then to export, and finally to model training, completing the closed-loop from model to data. Each link requires quality. If there's a problem in one link, it's not that the model can't be trained, but when the real model is applied to the ontology and then enters the scenario, if there's a problem in this scenario, how can we trace back to which part of the data or the original closed-loop has a problem? This is our requirement for data. Quantity is important, but we also need to consider quality and the importance of quality in each link.

There are many approaches. The commonly mentioned VLA mainly focuses on imitation learning, with visual input, language instructions, and the robot's actions. A robot sees a certain scenario, receives an instruction, and then outputs a corresponding action. It's trajectory-based data, mainly focusing on the trajectory. Another approach is the world model, which is often mentioned. In the world model, an action is added, which ultimately acts on the physical world. Here, we consider that when we see a scenario, apply an action, and then see what the physical world becomes. At this time, we're considering the causal relationship. Although there are differences in the models between VLA and the world model, they both require the same underlying assets, which are structured high-quality data in the real world. I define reasonable or suitable data for the final model task, digitize the information in the physical world through certain means, and then through a structuring process, turn it into something that can be input into the model. At this time, the original data is the same, but the intermediate processes are slightly different, based on the same set of data base.

The data base is a complete set that records the real scenarios, real tasks, real successes/failures, and real interactions with the entire environment, so that it can be input into the model and enable the model to obtain a closed-loop in the real world. This set of data input may come from the robot ontology. You can see many data collection factories and data training factories. By having people operate the robots, we can obtain data related to the robots, which can be directly used for the robot's pre-training and post-training. Now, there are also some more advanced methods, such as having people record their digital labor from a first-person perspective, digitizing human labor into the virtual world, and then training either VLA or the world model to enable the robot to learn human skills. In essence, it's all about turning the interaction between a human or a robot ontology and the environment, and turning the physical concept into a set of digital concepts. Zhiyu Jishi has developed a set of data base. No matter what kind of data flows in from the front end, we can process it into data that can be used by the model through the data compilation pipeline, and finally complete the closed-loop from data ontology, back to the scenario, and then back to the data.

How to turn the original data record of a task into data that can be used by the model? The first step is to define the task well. First, we need to know what kind of data to collect. We need to know what the robot sees, what actions it takes, and even what it hears. We also need to pay attention to the cause and effect. What was the previous scenario? After seeing this scenario, what kind of decision did I make? What actions did I take? If I take this action, what will I think next? How will the real world change? On one hand, we record all the sensor records in the real world. On the other hand, from the task record, it's not simply obtained from the sensors, but from pre-planning or post-deduction. By organizing the on-site records and tasks, we can turn it into a set of data assets needed by robots and embodied intelligence. This involves how to collect the data, extract the key factors, and finally how to precipitate it into assets. It also involves the handling of success/failure, how the robot retries after failure, what the retry strategy is, and what the result of the retry is. These are all important steps in turning the original data into training samples.

This is the five-layer data compilation pipeline model proposed by Zhiyu Jishi. We realize that the original data can't be directly input into the model for training just by collecting it and storing it on the hard drive. We consider that there are many processes in between, and each process has key indicators. Only by doing each step well can it not be simply data archiving, but something that can truly become data assets. This set of data assets can then enter the scenario, the model, and be combined with the ontology to be truly used.

The first process is data quality inspection. First, we collect the data. After collection, we can turn the model signals in the real physical environment into digital signals and store them in a digital form. The raw data is messy, unregulated, and unstructured. I don't know if it's good or not, and I don't know if it can enter the subsequent processing flow. The first step is to do data quality inspection to see if the data meets the basic data processing requirements.

After the data meets the requirements, it enters the data processing pipeline. The next step is data alignment. The data for robots or embodied intelligence is not simply pictures or simple videos. In fact, it's data that combines multi-modal and time-sequence closely. We need to complete the alignment of space and time, and the structuring of time and space. It's not simply messy data, but at least data that can be understood by data processing algorithms and machines. Each frame of data can be indexed horizontally and vertically. After that, we reach the level where the data becomes data that can be used by the model. We need to extract the real semantic or causal relationship part from the structured data. We need to know how the data interacts with the environment in the entire space, what the alignment with the intention is, and the causality, such as what happened before, what the scenario was, and what will happen later. This is the third step. By this point, the data can be used by the model, but it's still far from the real model generalization. At this time, we need to consider how to do large-scale data processing. Large-scale data has existed in many industries before, and now all industries are talking about the concept of big data. However, it's different in embodied intelligence because it's a type of data where time, space, and the entire task intention are all closely connected. We need to consider how to quickly retrieve the data needed by a certain type of model from hundreds of billions or even trillions of hours of large-scale data. This is also a very difficult task.

After the first four steps, it becomes relatively simple. We process and align the data well, extract all the content, find the data needed by the model, and finally deliver it to the customer. This is the last step, delivery.

Technically, we've completed the closed-loop from data to pre-training, but the final closed-loop of the data is far from over. The data must be used by the model company, and the model of the model company needs to be mounted on the ontology. It not only needs to complete the small-scale sample, but also be implemented in the industry. The data needs to start from the model deployment to the ontology and then be implemented in the industry, and finally obtain feedback from the industry and return to the data side. Only at this time can the data truly circulate, and the intelligence can be deployed not only at a single point but in the entire system. As the data side, it plays a very core role. It needs to connect with the ontology, the model, and also with the industry.

In many data industries nowadays, people still do things in a project-based form. The model has not converged, the ontologies are diverse, and the industry is gradually entering the entire embodied intelligence industry. The data that Zhiyu Jishi does is not just a data project. We've built the entire system. By connecting with the ontology, the model, and the industry, we've turned the project-based delivery ability into a set of data infrastructure that can be used in the entire embodied intelligence field. At this time, we can not only deliver a set of data, but also support the development of the entire embodied intelligence. In the future, all industries, ontologies, and models can obtain what they want from the data side.

We hope to divide the new data division of labor. Having the ontology company, the model company, or the industry side do the data alone can't support the development of the entire industry. Only by building such an ecosystem can high-quality data from the physical world enter the entire ecosystem and promote the development of the embodied intelligence industry.

That's the end of my sharing. Thank you.