The viral robots are still trapped in the "data assembly line".
Text | Zhang Bingbing
Editor | A Zhi
Backflips, dancing, boxing, kicking a watermelon into pieces... In the past year, these visually impactful robot action clips have been constantly trending. The industry is excited, capital is pouring in at an accelerated pace, and the public's expectations have reached an all - time high: mature robot products seem to be stepping quickly from the laboratory into reality.
However, in the data training center, known as the "school" for robots, the scene is much quieter: data collectors hold operating devices and guide the robots around them to complete some seemingly simple tasks, such as picking up parts on the table, putting them into a toolbox, and then closing the lid. The movements are slow and there are occasional pauses.
Beijing Humanoid Robot Data Training Center
This is just the first step of "learning". Every time a robot completes a set of actions, a structured data record is generated. Feeding these data into a large - scale model for training, and with a sufficiently large amount of data, it is possible for the robot to have a "brain", thus breaking away from passive programming control and moving towards active understanding and decision - making. In the words of industry insiders, this will be the "difference between a monkey and a human".
We are not unfamiliar with the logic of "data + computing power + algorithms". Represented by ChatGPT and DeepSeek, the large - language models that have swept the world have verified its feasibility and established a relatively mature computing power resource and algorithm system. However, the challenge for robots is that this time, the intelligence has moved from the digital world to the physical world, and data has become the highest barrier.
The language, image and other data used by large - language models essentially exist in the two - dimensional digital world, which are easy to obtain and replicate; while the three - dimensional physical world that robots face is a high - dimensional, continuous, multi - modal spatio - temporal flow, including various sensor signals such as vision, sound, force, torque, and body posture. The difficulty of data processing has increased exponentially.
If the data accumulated over the years in the Internet has provided sufficient ammunition for the development of large - language models, then the data collection and accumulation in the physical world almost need to start from scratch.
Among them, "real - machine data" comprehensively collects the original operation data of robots in the real physical environment. Its scarcity and value have become an industry consensus. In the past year, robot data collection centers have blossomed everywhere with the positioning of "infrastructure", and the dull but crucial data collection scenes like the one at the beginning of this article are taking place across the country.
However, real - machine training means a huge investment of time and capital. Once the construction of a data center starts, there is no turning back. In the midst of the boom, more sober thinking is needed: what kind of data can be considered "high - quality"? How can the trained data be efficiently circulated and reused? Before filling the data gap, how can the industry make practical progress?
Before the full - scale launch of the "new infrastructure" in the robot era, exploring and answering these questions will determine whether "embodied intelligence" is a solid industrial upgrade or just another over - hyped concept.
I. Data Collection: Meticulous Work Yields Fine Results
In the centralized training area of the Beijing Humanoid Robot Data Training Center, through the transparent glass, visitors can directly see how robots "learn". The data collector puts on gloves connected to the collection device, and the hand movements are transmitted to the robot beside, making the robot pick up the pliers on the table, put them into the toolbox, then take them out and put them in again, repeating the process.
Simple tasks such as grasping, picking, taking, and placing are trained in small scenarios in such a desktop environment. Looking further away, the view is blocked by a white screen. To prevent data contamination, each operation area is separated into individual compartments to physically isolate interference and ensure the cleanliness of the data.
In the scene training area on the other side, the picture becomes more complex. The unmanned supermarket is filled with goods, books are scattered in the living room, and clothes and towels are piled up in the bedroom and bathroom. In this highly restored scene where people can move freely, robots need to complete tasks such as arranging items and folding clothes in such a complex but more realistic environment.
Scene Training Area of Beijing Humanoid Robot Data Training Center
From monotonous primary action training to complex real - scene restoration, there is only one goal: to collect high - quality real - machine robot data in batches.
This is also the core goal of all data centers.
However, currently, the robot industry has not yet formed a unified data standard. Different data collection centers often have their own data expression methods and format requirements. The paths to achieve the goal may even diverge from the very beginning of the data center construction.
The operator of the Beijing Humanoid Robot Data Training Center is Reeman Intelligent Technology (Beijing) Co., Ltd. As a robot enterprise focusing on the research and development of robotic arms, Reeman has particularly high requirements for hardware among all dimensions of data evaluation.
According to the relevant person in charge of Reeman, in terms of hardware bodies, the data center requires high - precision calibration for each hardware body, including absolute motion accuracy and camera - related parameters. All robots are equipped with high - precision sensors, which can collect state data of up to 57 dimensions.
Another major hardware challenge comes from spatio - temporal alignment. Specifically, the sampling frequency of the cameras used in data collection is usually 30Hz, that is, 30 images are taken per second, and the time interval between each frame of images is about 33 milliseconds. If the time is not aligned, the 33 - millisecond difference will cause the joint encoder, camera, and force sensor to capture "world fragments" at "different times".
Model training relies on strict causal relationships, and a millisecond - level asynchrony may lead to serious misalignments. As introduced, Reeman adopts a hardware - based synchronous alignment strategy during the data collection process to ensure that all sensor data and camera data are collected according to the real physical time at the hardware level, with an error within 1 millisecond.
On the basis of high - precision hardware calibration and super - spatio - temporal alignment, through a diversity matrix system, the diversity of scene items and the generalization of robot positions and postures are achieved to ensure that the model does not deteriorate due to data over - fitting. After strict data credibility verification, a high - quality real - machine data record is considered to be collected.
The relevant person in charge of Reeman said that a robot that can truly enter households should have stable and reliable physical joints, be easy to use, and be able to exert the maximum load capacity with the minimum volume. At the AI level, data dimensions are crucial. "We believe that real - machine data is the last threshold for robots to enter households, so we firmly start from the end - goal and provide such data assets."
Currently, the Beijing Humanoid Robot Data Training Center has achieved large - scale production, generating about 60,000 data records per day, covering 16 sub - scenarios in four major fields: industrial manufacturing, smart home, health care services, and 5G integration.
II. The Gap between Data Shortage and Data Heterogeneity
Data from the technology market research institution Interact Analysis shows that by the end of 2025, more than 50 national, provincial, or municipal - level humanoid robot data collection and training centers in China are in use or under planning and construction. Among them, more than 50% of the data collection centers were officially put into use in 2025.
Taking the Beijing Humanoid Robot Data Training Center as a reference, its annual production capacity of real - machine data has reached the level of tens of millions. Roughly calculated, assuming that all current data centers are fully operational, the annual collection volume of robot data can reach billions of records.
This seemingly large data supply still seems a drop in the bucket in the face of the "intelligence" required by robots.
According to a conservative estimate by the robot data service provider Miter Technology, on the premise that the large - scale embodied intelligence model is good enough and the data quality is high enough, it takes about 1,000 - 5,000 data records to train a robot to learn an action; about 10,000 - 20,000 data records to train a robot to learn a task composed of multiple actions; at least 100 million data records to train a robot to complete 80% of human work in a certain vertical industry; if the embodied intelligence is to be generalized to all industries, the required data volume is at least in the order of hundreds of billions. The data gap is 4 - 5 orders of magnitude.
A bigger gap lies in data heterogeneity. Due to the differences in hardware design, sensor configuration, and software protocols of robots from different manufacturers and in different forms, the collected action, force - sense, and visual data are "incompatible". The data results based on one robot may not work on another robot.
This means that it is difficult for the data results trained by different data centers to achieve an additive effect.
Before a unified industry - wide standard emerges, data centers are exploring various solutions.
One is to "shield differences", using robotic arms or robot models with a high market share for data training. This approach avoids compatibility issues at the hardware root to pursue wider application of data, such as the Beijing Humanoid Robot Data Training Center mentioned above.
Another approach is to "embrace differences" and actively conduct heterogeneous training. In Zhangjiang, Shanghai, the National - Local Joint Humanoid Robot Innovation Center (hereinafter referred to as the "National - Local Center")'s Embodied Intelligence Training Ground has pioneered a method for constructing an embodied intelligence dataset for heterogeneous humanoid robots, aiming to create the largest - scale embodied intelligence dataset for heterogeneous humanoid robots.
Here, robots from different manufacturers operate collaboratively in the same physical space. Jiang Lei, the chief scientist of the National - Local Center, once said in an interview with the media, "Putting heterogeneous robots from different manufacturers in the same space allows AI to realize that it lives in a diverse physical world, thereby establishing an objective perception and developing the ability to distinguish right from wrong."
The third technical path is to directly "bypass differences" and find more extensive and general data. Different from the data collected by hardware such as joint sensors, human video data is relatively general for robots. The human body posture in the video data can be extracted and mapped to the robot's motion trajectory, bypassing the body - related barriers to train the large - scale model.
Visual Action Capture Project at Beijing Humanoid Robot Data Training Center
A more radical solution is to directly abandon the physical body and enter the simulation world. In a virtual digital environment, through physical engines and program simulations, a large amount of data can be generated at low cost and then applied to real machines to achieve Sim2Real. However, the extreme complexity of the physical world fundamentally determines that the accuracy and generalization of simulation data are difficult to reach the ideal level.
"We hope to find a balance between reality and simulation, taking advantage of both." The CEO of Miter Technology introduced its Real2Sim2Real data collection model: adding "Human Doing Video" in front of the virtual environment as a specimen and paradigm for robot learning. "We perform 3D reconstruction on the 2D video data of human operations from the real world, restore the 3D posture of the human body through simulation, and retarget the 3D posture to the robot, so we call it Real2Sim2Real."
It is reported that with this method, Miter Technology aims to reduce the cost of a single data record from dozens of yuan for current real - machine data to a few cents, and quickly distribute low - cost collection devices to all industries to obtain a large amount of data.
III. Optimize while "Working"
Although various technical paths such as the combination of virtual and real are still being explored, one certain fact is that real - machine data, regardless of its proportion, is the "last mile" for robots to align with the physical world. Therefore, the core proposition faced by data training centers is not only to pursue the scale of data, but also to precisely produce high - quality data that meets the current industrial application needs.
In Wuxi, this logic is being concretized.
The "Jiangsu Provincial Embodied Intelligent Robot Industrial Data Collection and Training Center" led by Tianqi Automation Engineering Co., Ltd. has abandoned the "showroom" model and highly restored seven major training scenarios, including the automobile manufacturing line, new - energy production line application scenarios, and industrial logistics handling scenarios.
"Automobile assembly is our traditional business at Tianqi. We have a large customer base and in - depth industry understanding of automobile production line scenarios." Tong Suibing, the chief algorithm scientist at Tianqi, said that there is a large demand for robots to replace humans in the automobile painting process.
Jiangsu Provincial Embodied Intelligent Robot Industrial Data Collection and Training Center
In automobile manufacturing, vehicle painting is one of the core processes. After the electrophoretic primer is applied to the vehicle body, the top - coat painting is required. The uniformity and integrity of the paint surface directly affect the quality of the vehicle. Traditionally, the quality inspection of this process highly depends on human eyes. However, the painting workshop is filled with volatile chemicals, which pose a certain risk to the health of workers in long - term operations. Letting robots replace humans for automated inspection and defect identification in such an environment can not only free workers from harmful exposure but also provide the possibility for more stable and traceable quality inspection.
Tong Suibing believes that for embodied intelligent robots, a more reasonable implementation method is not to design a general - purpose robot for all industries and jobs, but to design robots according to personalized needs.
Based on this, the Jiangsu Provincial Embodied Intelligent Robot Industrial Data Collection and Training Center has built a closed - loop system of "scene - data - model - application". In short, it focuses on existing business scenarios, precisely collects robot data in these scenarios, uses the collected data to train a self - developed large - scale embodied intelligence model, and deploys the trained model back to the corresponding actual production environment. Finally, it is verified and iterated in the real scenario.
The real scenario is not only a "touchstone" for the effectiveness of data and large - scale models but also a potential source of high - quality data.
At CES 2026, Reeman completed a trans - oceanic real - time operation demonstration from "Beijing - Las Vegas". By building a remote labor network, the embodied trainers in Beijing can remotely control the RealBOT wheeled folding robot at the CES booth to perform real - world tasks such as "delivering items" and "passing fruits".
This is not only a solution to the labor demand in specific scenarios but more importantly, it allows robots to directly accumulate data in real - world operation processes. Every remote operation synchronously generates data including environmental interaction, human decision - making, and task results, realizing "work is data collection". This means that in the future, data factories may not need to fully replicate scenarios but can directly connect to global production lines and service terminals, allowing data to naturally accumulate in real - world operations.