Der Datenschlachtfeld des Embodied AI: Qunhe baut Trainingsparadies, Baidu legt Datenpipeline, JD.com richtet Bühne auf

Wenn das Skalengesetz auf die physische Welt stößt

Superficially, it is a race for data, but in fact, it is about the rules on how data can be connected and flow.

The data battle for Embodied Intelligence is currently in full swing.

First, Tencent released the Open Platform for Embodied Intelligence Tairos, and then JD.com launched a data trading platform for Embodied Intelligence and plans to mobilize 600,000 people to collect 10 million hours of data.

Recently, Baidu also launched a data market for Embodied Intelligence to solve long - standing industry problems such as uneven data quality, different format standards, and high usage costs.

The marathon for humanoid robots in Yizhuang last week pushed the popularity of Embodied Intelligence to the peak.

Honor's robot "Blitz" completed the 21 - kilometer race in 50 minutes and 26 seconds, breaking the world record for men in the half - marathon. Suddenly, the comment section was in an uproar, "a historical moment", "the year of the implementation phase" has begun!

But upon closer inspection, one can find that this is more of a breakthrough in "mechanical ability" than in "AI ability". The reason why "Blitz" could achieve this performance lies in its 0.95 - meter - long legs, its self - developed liquid cooling system, and the increase of the motor torque from 420 Nm to 600 Nm.

All of these are the results of the accumulation of engineering competencies. Honor has transferred its capabilities in lightweight construction and structural design from consumer electronics in the past few decades to robotics. If the same algorithm is fed into another robot, it is very unlikely to achieve this performance.

The problem is not the algorithm, but that the concept of "Embodied Intelligence" contains too many meanings.

Running 21 kilometers continuously is one thing; helping you do work is another; and working on the production line without interruption for 8 hours is yet another completely different thing.

And these three things correspond to three completely different data requirements.

For three years, people have been saying "there is a lack of data", but no one can precisely say what is exactly lacking.

"There are not many terabytes of trainable data in the entire Internet in total, and now they are almost insufficient." said the founder of a leading large - scale model in China in an interview. "Now people are relying more on search expansion to implement B2B applications. However, for the B2C segment, the further development of the basic model is required to achieve a breakthrough."

This is the real concern in the large - language - model (LLM) industry.

Today, the "data anxiety" of LLMs has also spread to Embodied Intelligence. On almost every forum related to robotics, almost everyone says that the lack of data is the biggest bottleneck.

But when asked more precisely which data is exactly lacking, the answers vary greatly.

The reason why LLMs can apply the Scaling Law lies in a non - negligible prerequisite: Internet text itself is a "closed system".

A sentence contains intention, semantics, and even implicit inference paths at the same time. The model only needs to continuously extract patterns from these closed systems.

Therefore, you just need to "feed" the model more. The more it "understands", the more naturally its capabilities will emerge.

But Embodied Intelligence does not have such a closed system.

You can collect 1 million hours of videos of human life, but the information on how a robot should control its joints is missing; you can create 10 million simulation scenarios, but they often lack the noise and long - tail distribution of the real world; you can also collect 100,000 task - related data through remote control, but once the robot body is changed, the transfer effectiveness will significantly decrease.

The data of Embodied Intelligence is not "collected", but "produced" in the physical world.

In addition, different data types respond very differently to the "scale". Therefore, it is a misjudgment to simply apply the logic of LLMs to Embodied Intelligence.

If you divide the data of Embodied Intelligence, it will be clearer. It can be roughly divided into three categories: motion control, scene understanding, and task decision - making.

The motion control data tells the robot "how to move", such as joint angles, torque, motion trajectory, etc. These data are strongly bound to a specific body and are not scalable for reuse by nature.

The scene understanding data tells the robot "what it sees", such as visual perception, spatial perception, object recognition, etc. Since the world seen by humans and the world seen by robots are statistically similar, this data level is currently the only one that may be able to apply the Scaling Law.

The most difficult are the task - decision - making data, which must tell the robot "what to do". This is the scarcest data category in the entire system, as it requires that three things be fulfilled simultaneously: perception, judgment, and execution, and they must be annotated synchronously.

For these three data types, for some, the problem can be solved by increasing the quantity, while for others, it is completely impossible. In other words, in the field of Embodied Intelligence, the Scaling Law is not "useless", but "valid at different levels".

Someone in the industry has already described this problem. When interviewed by the press, Dai Meng Robotics said that the data supply of Embodied Intelligence has a pyramid structure.

The top of the pyramid is the data of the robot body, which is the most accurate but the most difficult to expand; the middle layer is the data that can be collected for implementation, and a compromise is reached between accuracy and scale; the bottom layer is the massive data from the human perspective, which is the easiest to scale.

The data of the bottom layer can be generated through "scaling" and is used for training "cognition". The data at the top must be adapted to the body and is used for training "execution". They need to be finely tuned, and there is no "more is better" logic.

Therefore, it no longer makes sense to simply talk about the "amount of data". The key lies in which layer you scale.

In line with this thinking, the academic world is also starting to find new solutions. The open - source project PHYAgentOS from Sun Yat - sen University separates the cognition level from the execution level, that is, the large model serves as the cognition input, not as the end executor.

This corresponds to a new way of data division: The data of the bottom layer is used for training cognitive ability and can be generalized across different bodies; the data at the top is used for training execution ability and always remains bound to the specific body.

Once this structure is established, the data utilization efficiency will fundamentally change: The data of different layers will no longer be forcibly pressed into the same model for processing.

After solving the problem of "where the data comes from", one also needs to consider how the data is "processed". This concerns the current main technological directions in the industry.

VLA (Visual Language Action) is the most commonly used and widely spread method. It compresses visual, language, and motor information into a model and outputs control signals. The representative players are RT - 2 and π0. This method requires the simultaneous existence of "image + command + movement", and the data collection cost is high, and it is the most difficult to scale.

The second way is the hierarchical large model. One uses the LLM for high - level planning and then calls the VLA or traditional control algorithms for execution. It sacrifices some end - to - end consistency to achieve higher data utilization efficiency. Typical examples are Google's Gemini Robotics, Peking University's RoboOS, and the above - mentioned PHYAgentOS.

The third is the currently most - watched world - model direction, such as DreamDojo, PAR/PhysGen. It emphasizes "understanding" physical laws directly from videos and zero - action pre - training. Representative players are NVIDIA from abroad and Tuoyuan Intelligence from China.

But different players also have different understandings of the same direction. Tuoyuan Intelligence chooses world inference in the latent space (instead of on the video image surface).

Chen Tianshui, co - founder of Tuoyuan Intelligence, said in an interview with Singularity: "NVIDIA's One Action Model mainly refers to the modeling of actions (Action). Tuoyuan models both actions and physical properties. The latent features (thousands of dimensions) are more efficient than video pixels (2 million pixels) and can better support action prediction."

The working mode of physical token autoregression: Prediction of future frames and the action combination that evolves synchronously with the real environment.

The JEPA proposed by the Turing - Award winner Yann LeCun also belongs to this paradigm, but it tends more towards the "predictive learning method", that is, future states are estimated in an abstract space, and causal relationships are learned.

At this point, we find that in the field of Embodied Intelligence, it doesn't make much sense to talk about "high - quality data" without considering the model architecture.

The words of Ma Xiaolong, co - founder of Zero Power, in an interview hit the nail on the head: "The effectiveness of data is essentially an adaptation problem. It may be useful for your model, but completely meaningless for my architecture. And for a third person in a different scenario, it may not be usable at all."

Qunhe builds the training arena, Baidu sets up the data pipeline, JD.com builds the stage

If you look at the recent data battles of large companies with this thinking, you will find that although they are all "fighting for data", they are completely different things.

The difference lies not in the "quantity", but in the "layer".

The bottom layer is Qunhe Technology. Qunhe Technology masters the layer where the Scaling Law is most likely to apply: the spatial data with "physical correctness".

According to the prospectus, Qunhe has already collected 500 million 3D interior scenes and 480 million 3D models. These data are not "collected", but come from the results of repeated use, modification, and validation in real - world business activities.

Qunhe Technology's InteriorNet dataset (contains about 130 million image data)

Based on these data, the built - up SpatialVerse is a "computable physical space": A ball falls when thrown, a door has resistance when opened, and the floor has friction.

Physical correctness means that it does not depend on the further development of a specific model architecture. No matter whether there are Transformers, world models, or other paradigms in the future, robots must finally learn in an environment that conforms to real - world physical laws.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Der Datenschlachtfeld des Embodied AI: Qunhe baut das Trainingsparadies, Baidu legt die Datenpipeline, und JD.com richtet die Bühne auf.

For three years, people have been saying "there is a lack of data", but no one can precisely say what is exactly lacking.

Qunhe builds the training arena, Baidu sets up the data pipeline, JD.com builds the stage