HomeArticle

In 2026, selling data will make money faster than selling robots.

数智前线2026-05-11 20:49
The Dilemma of Data and the Battle for Billions in Mining

Behind the 99% data gap in embodied intelligence, who is "selling shovels"?

Humanoid robots can dance on the Spring Festival Gala and run marathons, but they can't unscrew an unfamiliar bottle cap.

This is because the data is not "well - traveled" enough. In 2026, when the capital frenzy swept through the embodied intelligence track, a cruel truth is emerging: high - quality embodied data has become the biggest shackle locking the evolution of embodied intelligence. Facing the up to 99% data gap, players in the track are all going all out to build data infrastructure. Therefore, this year is considered the "Year of Data Scaling" for embodied intelligence.

"What we understand by the 'Year' is not that the problem has been solved, but that the industry has entered the stage of 'building a large - scale data system' from 'creating demos' for the first time," a relevant person from Guanglun Intelligence told Digital Intelligence Frontline. They observed three things happening: First, millions of hours of effective high - quality data have become the entry threshold for leading teams; Second, data investment has jumped from marginal budget to core budget; Third, more and more real industrial scenarios are starting to pay for the data infrastructure required for the training, evaluation, and deployment of embodied intelligence.

There has always been an iron rule in the AI circle: those who "sell shovels" are always the first to make money. In 2026, a data business about embodied intelligence is quietly boiling.

Embodied Data: Demand Explosion

At the beginning of 2026, the data demand in the field of embodied intelligence is rapidly heating up.

"I think the demand is a hundred times that of last year," Yang Haibo, the co - founder and president of Guanglun Intelligence, revealed. As a unicorn enterprise in the field of embodied data, Guanglun Intelligence's data business has shown an obvious upward trend. In the first quarter of 2026, it has won orders worth 550 million yuan, exceeding the total order amount of the whole year of 2025 and setting a new industry record. More than 80% of the simulation assets and synthetic data of major international embodied intelligence teams come from this company.

Behind this, data has been elevated to an unprecedented strategic position. "Since this year, embodied data has shifted from ancillary investment to a core budget item and has become one of the fastest - growing segments in customers' budgets," a relevant person from Guanglun Intelligence told Digital Intelligence Frontline. The industry has clearly recognized that what determines the upper limit of model capabilities and the speed of scenario implementation is not only algorithms and the ontology but also whether there is a continuous, iterative, and evaluable data supply system.

Yao Maoqing, the chairman and CEO of Mifeng Technology under Zhiyuan Robotics, also felt this wave of heat. He revealed in April that the data demand is currently concentrated among leading large - model teams, domestic and foreign embodied intelligence giants, and start - up companies. "The demanders are generally in a state of 'I'll buy as much as you have, and I'll take it as soon as you have it'."

In his view, data will become a basic production factor like computing power, with investment attributes and a return cycle. "Those who'sell shovels' are the first to make money," Yao Maoqing believes. At this stage, the industry needs a large amount of data for R & D and verification, which in turn gives rise to applications. Referring to the logic of infrastructure - first, the return cycle of data will be faster than that of the ontology robot or solutions for specific industries.

The industry generally believes that there are three main driving factors behind this wave of explosive demand:

First, the evolution of the "brain" forces the need for data "rations". The core bottleneck restricting the large - scale implementation of robots has shifted from hardware and underlying motion control to the "brain", that is, the deficiency of the embodied intelligence model itself. Embodied VLA and world models are making rapid breakthroughs and starting to enter more complex task spaces. In this process, they must be fed with a large amount of data.

Second, the acceleration of industrial implementation has shifted the data demand from the laboratory level to the deployment level. When robots start to enter real scenarios such as factories, logistics, and commerce, their requirements for data scale have significantly increased. "A robot may need a thousand - hour - level training data to complete a single task, and more for complex tasks," Yang Haibo said.

Third, the value of non - ontology data has been verified, and the collection efficiency has soared. In the past, the collection of embodied data mainly relied on manual operations in the laboratory, and only dozens of hours of data could be collected in a day, far from meeting the data scale required for the training of embodied intelligence models and industrial implementation. Now, technologies such as VR tele - operation, exoskeletons, UMI, and Ego are gradually maturing, and data collection has moved from a small - scale, low - efficiency "handicraft workshop" to a larger - scale, higher - efficiency data scaling production stage.

However, in sharp contrast to the explosive demand is a serious "data desert". The industry consensus is that training an embodied model with general generalization ability requires at least tens of millions of hours of data support. But as of the beginning of 2026, the total amount of high - quality real physical interaction data globally is only about 500,000 hours, less than one - twentieth of the data used for large - language model training. CSDN data also shows that embodied intelligence requires hundreds of PB - level physical interaction data, and the current gap is over 99%.

With both opportunities and challenges, a battle for embodied data has begun.

The Data Pyramid: Players' Positioning

Facing the 99% data gap, the supply side has bid farewell to scattered trials and quickly set off a frenzy of data infrastructure construction.

"Millions of hours" has become the standard entry threshold. Lingchu Intelligence, Luming Robotics, Xinghaitu, etc. are all sprinting to collect millions of hours of effective data. JD.com proposed to collect 1 million hours of robot ontology data + 10 million hours of real - world human scenario video data within two years. Mifeng Technology officially announced that it will achieve a data production capacity of tens of millions of hours in 2026.

Behind the large - scale expansion of the industry is a general consensus on the "data pyramid": The top layer is real - machine data, which has the highest accuracy and is most in line with real scenarios, but is costly and in short supply; the middle layer is simulated synthetic data, which is low - cost and easy to mass - produce on a large scale, but faces the "Sim - to - Real" migration problem; the bottom layer is Internet video and human behavior data, which has strong generalization ability but low accuracy and requires a large amount of cleaning and action alignment. All three types of data are indispensable, and industry players are comprehensively positioning and deploying around the pyramid.

The supply side first focused on the real - machine data at the top of the pyramid. Among them, mainstream tele - operation data is regarded as "golden data", which is obtained by professional personnel remotely controlling real robots through master - slave control or VR devices to complete delicate actions. According to third - party data, as of early April 2026, the number of embodied intelligence data collection centers, innovation centers, and training fields planned or to be built nationwide has reached 64, covering at least 27 cities.

Leading enterprises have become the main force in construction: Zhiyuan has deployed data collection centers in Shanghai, Chengdu, etc.; Luming Robotics has built 3 standardized data collection fields. After Pacini completed its Tianjin data collection factory in April last year, it announced this year that it will build 4 more data collection factories in Suqian, Wuhan, Ganzhou, etc. JD.com plans to launch a crowdsourcing data collection project involving 600,000 people. Local governments, such as Shanghai Zhangjiang, have built the first heterogeneous humanoid robot training field in the country, with the goal of collecting 5 million pieces of real - machine data within the year.

However, limited by the collection cost and efficiency, it is difficult to scale up real - machine data quickly. The industry is accelerating the shift to a hybrid strategy of "strengthening the middle - layer simulation data + consolidating the bottom - layer human data" to reduce the absolute dependence on expensive real - machine data.

Simulated synthetic data is currently the mainstream route for large - scale data production. Guanglun Intelligence believes that in the future, simulation data will undertake large - scale pre - training, evaluation, and reinforcement learning tasks, human video data will provide behavioral priors, and real - machine data will be more used for scenario alignment and the final 1% fine - tuning. For this reason, Guanglun Intelligence has self - developed a physical simulation engine to reproduce the laws of object motion and deformation in the real world, and has built a technical system covering simulation world generation, large - scale data production, and model ability evaluation around the three - layer architecture of "world - behavior - evaluation".

In addition to real - machine and simulation data, non - ontology data represented by UMI and Ego - centric data (first - person human video data, hereinafter referred to as Ego data) is emerging. This type of data can record operation trajectories by simply having the collector wear wearable devices, and it combines high efficiency, low cost, and strong generalization ability. Yao Maoqing revealed that the market price of domestic real - machine data is about 500 - 1000 yuan per hour, and the collection efficiency of non - ontology data is about two to three times that of real - machine data. Although there were cases where the quotation was more expensive due to insufficient scaling, it is expected to eventually converge to one - third to one - half of the real - machine data.

Among them, the UMI solution demonstrates operations by manually holding a gripper, and the whole process is recorded by a camera. As long as the appearance of the gripper and the camera parameters are the same, the data can be used for different robotic arms, supporting cross - ontology data reuse. Ego data collects first - person perspective and action information through head - worn and wrist - worn devices. Both solutions are easier to implement "crowdsourcing collection".

Luming Robotics has released the "full - package" of FastUMI non - ontology data collection products and plans to build a UMI data production capacity of over 1 million hours in 2026. JD.com has launched its self - developed ultra - high - definition collection terminal JoyEgoCam, which is suitable for scenarios such as warehousing, retail, and housekeeping. Mifeng Technology has released the MEgo series of non - ontology data collection devices, and 60% - 70% of its planned data production capacity of tens of millions of hours within the year will come from non - ontology collection.

The market for embodied data is accelerating its explosion, but millions of hours are far from the end. The real bottleneck in the industry is not a single data source but the lack of a unified, circulable, and sustainable data infrastructure. JD.com has launched an end - to - end infrastructure for embodied intelligence data and a data trading platform. Leju has joined hands with China Mobile, Huawei, Alibaba Cloud, etc. to build a data ecosystem. Mifeng Technology positions itself as a one - stop physical AI data service platform. Guanglun Intelligence is also continuously improving its embodied data engine, building a simulation ecosystem and an evaluation closed - loop, and plans to produce 10 million hours of embodied data in cooperation with over 1000 scenario providers this year.

What kind of data can feed embodied intelligence?

As the battle for embodied intelligence data begins, a key question emerges: What do data demanders pay the most attention to when making purchases? What kind of data is the "good data" most urgently needed by the industry at present?

A relevant person from Guanglun Intelligence told Digital Intelligence Frontline that when customers purchase embodied data today, what they care most about is not "whether the quantity is large" or "whether the unit price is high", but whether this batch of data can truly be transformed into an improvement in model capabilities. What they are buying is not just "data volume" but "the systematic ability to support the closed - loop of training, evaluation, and deployment".

Cao Yu, the person in charge of the embodied data solution of Kupasi, also said that after communicating with leading companies, the general feedback is that what the current algorithm most needs is not another batch of data, but a method to directly feed the data into the model and make it run - around the final commercial application scenario, how the data should be collected, labeled, trained, and evaluated, and whether the effect can be clearly explained. The industry is pursuing the "AI ready" state.

A relevant person from JD.com's embodied intelligence department pointed out: "Customers first pay attention to the type of data and will ask if it is from tele - operation or head - worn devices; secondly, they care about whether the data has been processed and labeled, and which dimensions are labeled, such as hand key points, position, text description, and whether the accuracy is in millimeters or centimeters." These will all become important references for embodied enterprises to decide whether to use the data.

The industry observes that truly high - quality embodied data usually meets four conditions simultaneously:

First, physical reality. This is the bottom line. Different from Internet graphic and text data, embodied data not only needs to have a real picture but also accurately restore key physical information such as contact, force, and state change. If the data lacks physical reality, the trained robots will easily have problems such as failed grasping and unbalanced operation in the real world.

Second, scalability. It should be able to support pre - training and continuous iteration, rather than just being enough to make a few demos. Xie Chen, the founder and CEO of Guanglun Intelligence, emphasized that only data that is both sufficiently scalable and capable of lifelong learning is good data.

Third, high enough diversity. The model needs to see the whole picture of the world, which requires that the scenarios, tasks, execution paths, and operation habits covered by the data should be diverse enough, especially not just perfect success trajectories. Yang Haibo of Guanglun Intelligence emphasized that data with failures and flaws also has extremely high value. "We once had a customer who bought this kind of 'not - so - successful' case data at 1.5 times the price." Yao Maoqing of Mifeng Technology also said that they will deliberately capture data of failures and recoveries during collection.

The logic is that in the pre - training stage, the "diversity" of data is more important than the "correctness". Just as a baby learns to walk through trial and error, embodied intelligence also needs to autonomously learn physical laws and causal logic from data mixed with correct and incorrect information. There are no always - standard actions in the real world, and a lot of data containing the process of "failure - correction - success" is more valuable because it is closer to the learning path in the real world.

Therefore, Zhu Zheng, the co - founder and chief scientist of Jijia, pointed out, "Some work in the industry, such as only defining the final goal without strictly defining the collection process and letting the collectors collect data based on their own understanding as much as possible, I think this is a good start." Zhu Zheng said.

Huang Yongtao, the chief data scientist of Ant Lingbo Technology, added that although the data from fixed - station assembly lines in factories is large in volume and highly standardized in actions, it is highly homogeneous, and its marginal value for improving model capabilities is relatively low. High - quality embodied data values diversity more than simple regularity.

Fourth, end - to - end usability. Zhu Zheng of Jijia pointed out that the current labeling of embodied data is generally too brief. For traditional multi - modal graphic and text models, a single picture is accompanied by thousands of words of detailed labels to completely restore the scenario background, picture details, and multiple understanding perspectives. However, at present, most embodied video data only has basic action labels, lacking detailed descriptions of environmental semantics and task processes, which is far from meeting the requirements of high - quality model training.

In addition to these four dimensions, the industry has also proposed a deeper - level standard: behavior alignment. Yao Guocai, the person in charge of Embodied Infra & Data at the Institute of Intelligence, believes that the mission of embodied data is to better represent human behavior and align the model with human behavior. Truly valuable data should capture and record human real - behavior patterns with high fidelity and diversity, including those unconscious and hidden behaviors - such as judging whether a water cup is clean before picking it up. These details are what most current models and data systems have not considered.

Taking the currently highly - concerned Ego data as an example, one of its core values lies in the collection "in the wild" (in natural/real scenarios) to capture various behavior patterns in daily life. However, many data collection manufacturers still follow the idea of artificially designed tasks and let data collectors collect repeatedly, which just discards the most important wild natural behavior capture of this kind of data. In addition, data modalities closely related to human intentions, such as