StartseiteArtikel

Ein Unternehmen für Embodied AI erhält mehrere Millionen Yuan an Finanzierungen durch die Extraktion multimodaler Embodied-Daten aus Internetvideos und die Reduzierung der Datenextraktionskosten auf 0,5% des Branchendurchschnitts | Exclusive von Yingke

黄 楠2025-10-11 09:00
Video-basierte datenerfassungsplanung für die Validierung von Drittanbieter-Embodied-Modellen.

Autor | Huang Nan

Redakteur | Yuan Silai

Hard Kr has learned that Shutu Technology (Shenzhen) Co., Ltd. (hereinafter referred to as "Shutu Technology") recently completed an angel round of financing worth tens of millions of yuan. This round was jointly led by Orient Fortune Capital and Jiangu Capital. The financing funds will be mainly invested in the continuous training and iteration of the video embodied data collection pipeline to accelerate the commercial data delivery process for several leading embodied intelligence enterprises.

"Shutu Technology" is an enterprise that Hard Kr has been following for a long time. The company was founded in 2024 and focuses on the research, development and application of multi-modal embodied intelligence data collection and model technology. By integrating vision, language and environmental interaction systems, it builds a general embodied data platform that can adapt to open scenarios to promote the large-scale implementation of embodied intelligence technology in fields such as logistics, manufacturing and services.

Currently, as the development of embodied intelligence enters the application stage, the scale, quality and diversity of training data have become the core bottlenecks restricting the improvement of model performance.

Although end-to-end imitation learning shows strong task fitting ability in structured scenarios, it relies on a large amount of high-quality demonstration data and has problems such as causal confusion and weak generalization, making it difficult to adapt to dynamic and open environments. On the other hand, although the data collection method based on teleoperation can obtain direct human operation signals, it is restricted by high hardware costs, low operation efficiency and narrow scene coverage, resulting in high costs and difficulty in achieving large-scale data production.

In this context, relying solely on a limited-scale closed data set or a high-cost simulation platform can no longer meet the needs of the next-generation embodied intelligence system for more modalities, longer time sequences and stronger interactive data. The industry urgently needs an expandable, low-cost and highly realistic data source to break through the current ceiling of the model in terms of generalization, adaptability and reasoning ability.

Training humanoid robot actions using online videos (Source/Enterprise)

In response to this common pain point, several leading enterprises in the industry have noticed the Internet video as a data source: the acquisition and production costs of Internet videos are low, and they come from the real physical world, embedding high-quality and high-dimensional information such as the physical parameters and natural laws of the objective world.

At the beginning of August this year, Musk revealed on X that Tesla Optimus is gradually abandoning the teleoperation route and is expected to learn new skills independently through YouTube videos in the next few years; Figure also announced in mid-September that its Helix has achieved full training based on human self-perspective videos, understands natural language instructions and realizes autonomous navigation in real and cluttered environments.

Different from other enterprises that innovatively process video data and only serve their own models, "Shutu Technology" independently developed the SynaData data pipeline solution, which can extract multi-modal embodied data from videos and serve third-party embodied models.

SynaData data pipeline solution (Source/Enterprise)

By collecting a large amount of RGB videos from the Internet and making technological breakthroughs such as video data dimension elevation and cross-domain retargeting, Shutu's SynaData data pipeline solution converts videos into multi-modal, high-precision embodied training data, providing a sustainable high-quality data source for large-scale training of embodied intelligence and reducing the comprehensive data collection cost to one-thousandth of the industry average.

For example, in the task of "picking up a takeaway bag", the SynaData system can extract in batches multi-modal embodied data including hand movement trajectories, object movement paths, object three-dimensional surface meshes, etc. from ordinary daily videos of people picking up bags, and directly use them for the training of robot grasping models. The test results show that the success rate of grasping takeaway bags by the model trained based on this data set has increased to 88%, showing strong scene generalization ability.

Converting an Internet video showing bag carrying into data for training a robot to carry a bag (Source/Enterprise)

Currently, the SynaData system has completed the full-pipeline technology verification, cumulatively processed thousands of hours of video content covering various indoor and outdoor environments, and produced standardized data sets covering more than 100 types of tasks such as grasping, placing and fine assembly. Part of the data has been applied in mainstream open-source vision-language-action models such as Tsinghua RDT, PI π0, Zhiyuan UniVLA and EquiBot.

Considering the current bottlenecks in terms of accuracy, generalization and standardization of video data, "Shutu Technology" is promoting system upgrades in three major directions: accuracy improvement, generalization expansion and ecosystem co-construction. For example, in terms of accuracy, in response to the pain point of insufficient capture of detailed actions in current complex interaction scenarios, the company will use dynamic occlusion modeling and multi-view reconstruction technology to improve the trajectory and pose reconstruction accuracy from the millimeter level to within 2 millimeters, providing data support for fine operation tasks.

In terms of generalization ability, to address the differences in structure, degrees of freedom and control methods of different robot bodies, the company plans to expand the number of adaptable body types to more than 100, covering the entire spectrum of hardware from humanoid robots, dexterous hands to various mobile chassis.

SynaData embodied data extraction (Source/Enterprise)

In terms of ecosystem construction, "Shutu Technology" is expected to launch the industry's first open-source embodied data set based on real-scene videos in the fourth quarter of 2025, aiming to connect the entire link from data production, simulation training to system deployment and jointly build the next-generation embodied intelligence data infrastructure with leading simulation environment partners.

Lin Xiao, the CTO of "Shutu Technology", told Hard Kr that data determines the upper limit, and the model approaches the upper limit. SynaData will unlock the "data treasure trove" of a large number of videos on the Internet, helping embodied robots move from "hands-on teaching" to "watching teaching", efficiently obtaining interaction data from the physical world, breaking through the ability ceiling and providing core data support for robots to enter all industries.