Embodied AI Data Battle: Qunhe Builds a Training Ground, Baidu Lays Pipelines, and JD Builds a Stage
On the surface, it's a battle over data. In essence, it's a battle over the rules of how data is connected and flows.
Recently, the data war in embodied intelligence has been heating up.
Previously, Tencent released the Tairos Embodied Intelligence Open Platform. Later, JD.com launched an embodied intelligence data trading platform and plans to mobilize 600,000 people to collect 10 million hours of data.
Not long ago, Baidu also launched an embodied intelligence data supermarket, aiming to solve the long - standing pain points in the industry, such as uneven data quality, inconsistent format standards, and high usage costs.
Last week, the humanoid robot marathon in Yizhuang pushed the popularity of embodied intelligence to a climax.
Honor's robot "Lightning" completed the 21 - kilometer run in a net time of 50 minutes and 26 seconds, breaking the world record for the men's half - marathon. For a moment, the comment section was in an uproar. "A historic moment" and "the first year of the deployment state" have arrived!
However, upon closer study, it turns out to be more of a breakthrough in "mechanical ability" rather than "AI ability". The reason why "Lightning" was able to achieve this result is due to its 0.95 - meter long legs, self - developed liquid - cooling system, and the motor torque increasing from 420 Nm to 600 Nm.
These are all accumulations of engineering capabilities. Honor has transferred its lightweight and structural design capabilities from the consumer electronics industry over the past decade to robots. If the same set of algorithms is installed in another robot, it's highly likely that it won't achieve the same result.
The problem doesn't lie in the algorithms. Instead, the term "embodied intelligence" encompasses too many meanings.
Running 21 kilometers continuously is one thing; being able to help you with work is another; and being able to work continuously on the production line for 8 hours without downtime is yet another completely different thing.
These three things correspond to three completely different data requirements.
People have been complaining about the "lack of data" for three years, but no one can clearly say what exactly is lacking.
"There isn't much data available for training on the entire Internet, only a few terabytes, and it's almost running out," said the founder of a leading domestic large - model manufacturer in an interview. "Currently, most people are using retrieval enhancement to implement B - end applications. For the C - end, the evolution of the base model is still needed to make a breakthrough."
This is the real anxiety in the field of large language models (LLMs).
Now, the "data anxiety" of LLMs is spreading to embodied intelligence. At any robot - related forum, almost everyone is saying that the lack of data is the biggest bottleneck.
But if you dig deeper and ask what kind of data is actually lacking, the answers vary widely.
For LLMs to follow the Scaling Law, there is a major premise that cannot be ignored: Internet text itself is a "closed - loop system".
A sentence contains intent, semantics, and even implicit reasoning paths. All the model needs to do is continuously extract patterns from these closed - loops.
So the more data you "feed" the model, the more it "understands", and its capabilities will naturally emerge.
However, embodied intelligence doesn't have such a closed - loop.
You can collect 1 million hours of human life videos, but they don't contain information on how robots should control their joints; you can build 10 million simulation scenarios, but they often lack the noise and long - tail distribution in the real world; you can also accumulate 100,000 task data through teleoperation, but once the robot body is changed, the transfer effect will be significantly reduced.
The data for embodied intelligence is not "collected" but "created" in the physical world.
Moreover, different types of data respond differently to "scale". Therefore, applying the logic of LLMs directly is a misjudgment.
If we break down the data of embodied intelligence, it will be clearer. It can be roughly divided into three categories: motion control, scene understanding, and task decision - making.
Motion control data tells the robot "how to move", such as joint angles, torques, and motion trajectories. This type of data is highly bound to specific bodies and naturally lacks the ability to be reused on a large scale.
Scene understanding data tells the robot "what it sees", such as vision, space, and object recognition. Since the world seen by humans and the world seen by robots are statistically similar, this type of data is currently the only level that may follow the Scaling Law.
The most difficult is task decision - making data, which tells the robot "what to do". This is the rarest type of data in the entire system because it requires three things to happen simultaneously: perception, judgment, and execution, and they must be labeled synchronously.
For these three types of data, some problems can be solved by increasing the quantity, while others cannot. In other words, in the field of embodied intelligence, the Scaling Law doesn't "fail"; instead, it "holds in different layers".
Someone in the industry has actually described this problem. When interviewed by the media, Daimeng Robotics said that the data supply of embodied intelligence has a pyramid structure.
The top layer is the robot body data, which is the most accurate but the most difficult to expand; the middle layer is the deployable collection data, which strikes a balance between accuracy and scale; the bottom layer is the large - scale data from the human perspective, which is the easiest to increase in quantity.
The bottom - layer data can be accumulated through "scaling up" and is responsible for training "cognition". The top - layer data must be refined according to the specific body and is responsible for training "execution". It doesn't mean that "the more, the better".
This is why simply talking about "data scale" no longer makes sense. The key lies in "which layer you are expanding".
Following this idea, the academic community has also started to try to provide new solutions. The open - source project PHYAgentOS published by Sun Yat - sen University decouples the cognitive layer from the execution layer, that is, the large model serves as the cognitive entry rather than the final executor.
Behind this is a new way of data division of labor: the bottom - layer data trains cognitive abilities and can be generalized across different bodies; the top - layer data trains execution abilities and is always bound to specific bodies.
Once this structure is established, the data utilization efficiency will change qualitatively: Data from different layers is no longer forced into the same model for digestion.
After solving the problem of "where the data comes from", we also need to see how the data is "digested", which involves several mainstream technical routes in the current industry.
VLA is the most common and mainstream. It combines vision, language, and actions into one model and outputs control signals. The representative players are RT - 2 and π0. This route requires data that includes "images + instructions + actions" simultaneously. Missing any one of them is not acceptable. The collection cost is high, and it's the most difficult to scale up.
The second route is the hierarchical large - model approach. It uses LLMs for high - level planning and then calls VLA or traditional control algorithms for execution. It sacrifices some end - to - end consistency but gains higher data utilization efficiency. Typical representatives include Google's Gemini Robotics, Peking University's RoboOS, and the aforementioned PHYAgentOS.
The third is the world - model route, which is currently the most attention - grabbing, such as DreamDojo and PAR/PhysGen. It emphasizes directly "understanding" physical laws from videos and zero - action pre - training. Representatives include NVIDIA abroad and Tuoyuan Intelligence in China.
However, different players have different understandings of the same route. Tuoyuan Intelligence chooses to conduct world - state inference in the latent space (rather than in the video frames).
Chen Tianshui, the co - founder of Tuoyuan Intelligence, mentioned in an interview with Singularity: "NVIDIA's One Action Model mainly focuses on modeling actions. Tuoyuan models both actions and physical states. The latent features (thousands of dimensions) are more efficient than video pixels (2 million pixels) and can better support action prediction."
The operation mode of physical token autoregression: predicting the combination of future frames and actions and evolving synchronously with the real environment
JEPA proposed by Turing Award winner Yann LeCun also belongs to this paradigm, but it leans more towards "predictive learning", that is, inferring future states in the abstract space and learning causal relationships.
At this point, we can see that in the field of embodied intelligence, talking about "high - quality data" without considering the model architecture doesn't make much sense.
Ma Xiaolong, the co - founder of Zero Power, accurately pointed out the essence in an interview: "The effectiveness of data is essentially a matching problem. What is useful for your model may be meaningless for my architecture, and it may be completely useless for a third - party in a different scenario."
Qunhe builds a training ground, Baidu lays pipelines, and JD.com sets up a stage
Looking at the recent data competition among large companies with this perspective, we can find that although they are all "grabbing data", they are actually grabbing different things.
The difference lies not in "quantity" but in "layer".
At the bottom layer is Qunhe Technology. Qunhe occupies the layer where the Scaling Law is most likely to hold: "physically correct" spatial data.
According to the prospectus, Qunhe has accumulated 500 million 3D indoor scenes and 480 million 3D models. These data are not "collected" but are the results of repeated calls, modifications, and verifications in real - world business use.
The InteriorNet dataset launched by Qunhe Technology (containing approximately 130 million image data)
Based on these data, the SpatialVerse is a "computable physical space": a ball will fall when thrown, there will be resistance when a door is pushed open, and there is friction on the floor.
Physical correctness means that it doesn't depend on the evolution of any specific model architecture. Whether it's the Transformer, the world model, or other paradigms in the future, robots must ultimately learn and make decisions in an environment that conforms to real - world physical laws.
This means that once the Scaling Law for the bottom - layer data holds, the value of Qunhe will be exponentially magnified. It doesn't need to bet on "which model will win". Instead, it bets that all models must enter the "training ground".
If Qunhe solves the problem of "where the data comes from", then the next layer up is what Baidu is doing: answering "how the data flows".
Baidu's Embodied Intelligence Data Supermarket is a neutral data circulation platform. It doesn't participate in robot bodies and doesn't directly produce data. Instead, it tries to "organize" the data scattered in different enterprises and scenarios.
According to official disclosure, the Embodied Intelligence Data Supermarket has currently connected the data of more than ten embodied intelligence enterprises, with a total of over 10 million records. At the same time, it has launched the "Starry Sky Plan", planning to recruit approximately 100 scenario providers to open up real - world spaces.
What's more worth mentioning is its "heavy - service model". "The data on Baidu's Data Supermarket needs professional processing, and there is currently no free upload mechanism. We have a high - end engineer team to support customers for free and only charge for computing power and storage," said Xu Liang, the sales director of Baidu Smart Cloud's general technology innovation industry, in an interview.
This means that it is not a simple matching platform but more like a "data processing factory" with strong processing capabilities: data needs to be cleaned, labeled, and structured before it can be used.
Meanwhile, Baidu is also supplementing another more fundamental infrastructure: trusted data circulation. This includes the cloud - network - terminal security system and compliance capabilities for overseas markets. "The cloud - network - terminal security solution jointly developed by Baidu and leading customers has been applied to products exported to Europe," Xu Liang added.
If we use a more intuitive analogy, Baidu is more like the "Visa" in the era of embodied intelligence: it doesn't directly participate in transactions but determines whether and how the "transaction" of data can occur.
Going up one more layer is JD.com.
Actually, the value of JD.com has been seriously underestimated. It launched an embodied intelligence data trading platform and mobilized 600,000 people to collect 10 million hours of video data of real - world human scenarios. In the Yizhuang Marathon robot event, JD.com, as an AI technology strategic partner, provided full - cycle support such as transportation, rescue, battery replacement, and maintenance.
The event directly doubled the sales of more than 20 robot brands, and the relevant search volume increased by 300