Embodied AI Startup Secures Multi - Million Yuan Funding by Extracting Multimodal Embodied Data from Internet Videos, Slashing Data Collection Costs to 0.5% of Industry Average

A video embodied data collection solution for third-party embodied model verification.

Author | Huang Nan

Editor | Yuan Silai

Yingke has learned that Shutu Technology (Shenzhen) Co., Ltd. (hereinafter referred to as "Shutu Technology") recently completed an angel round of financing worth tens of millions of yuan. This round was jointly led by Orient Fortune Capital and Jiangu Capital. The financing funds will be mainly invested in the continuous training and iteration of the video embodied data collection pipeline to accelerate the commercial data delivery process for several leading embodied intelligence enterprises.

Shutu Technology is an enterprise that Yingke has been paying long - term attention to. The company was established in 2024 and focuses on the R & D and application of multi - modal embodied intelligence data collection and model technology. By integrating vision, language, and environmental interaction systems, it builds a general embodied data platform that can adapt to open scenarios to promote the large - scale implementation of embodied intelligence technology in fields such as logistics, manufacturing, and services.

Currently, as the development of embodied intelligence enters the application stage, the scale, quality, and diversity of training data have become the core bottlenecks restricting the improvement of model performance.

Although end - to - end imitation learning shows strong task fitting ability in structured scenarios, it relies on a large amount of high - quality demonstration data and has problems such as causal confusion and weak generalization, making it difficult to adapt to dynamic and open environments. On the other hand, although the data collection method based on teleoperation can obtain direct human operation signals, it is restricted by high hardware costs, low operation efficiency, and narrow scene coverage, resulting in high costs and difficulty in achieving large - scale data production.

In this context, relying solely on a limited - scale closed data set or a high - cost simulation platform can no longer meet the needs of the next - generation embodied intelligence system for more modalities, longer time series, and stronger interactive data. The industry urgently needs an expandable, low - cost, and highly realistic data source to break through the current ceiling of the model in terms of generalization, adaptability, and reasoning ability.

Training humanoid robot actions using online videos (Source/Enterprise)

To address this common pain point, several leading enterprises in the industry have noticed the data source of Internet videos: Internet videos are inexpensive to obtain and produce, and they come from the real physical world, embedding high - quality, high - dimensional information such as the physical parameters and natural laws of the objective world.

At the beginning of August this year, Elon Musk revealed on X that Tesla Optimus is gradually abandoning the teleoperation route and is expected to learn new skills independently through YouTube videos in the next few years. Figure also announced in mid - September that its Helix has achieved full training based on human first - person view videos, can understand natural language instructions, and achieve autonomous navigation in real and cluttered environments.

Different from other enterprises that innovatively process video data and only serve their own models, Shutu Technology independently developed the SynaData data pipeline solution, which can extract multi - modal embodied data from videos and serve third - party embodied models.

SynaData data pipeline solution (Source/Enterprise)

By collecting a large amount of RGB videos from the Internet and making technological breakthroughs such as video data dimension enhancement and cross - domain retargeting, Shutu's SynaData data pipeline solution transforms videos into multi - modal, high - precision embodied training data, providing a sustainable high - quality data source for large - scale training of embodied intelligence and reducing the comprehensive data collection cost to one - five - thousandth of the industry average.

For example, in the task of "picking up a takeaway bag", the SynaData system can batch - extract multi - modal embodied data including hand movement trajectories, object movement paths, and 3D surface meshes of objects from daily videos of ordinary people picking up bags, and directly use it for the training of robot grasping models. Test results show that the success rate of grasping takeaway bags of the model trained based on this data set has increased to 88%, indicating strong scene generalization ability.

Transforming Internet videos showing bag - carrying into data for training robots to carry bags (Source/Enterprise)

Currently, the SynaData system has completed the full - pipeline technology verification, processing thousands of hours of video content covering various indoor and outdoor environments and producing standardized data sets covering more than a hundred task types such as grasping, placing, and fine assembly. Part of the data has been applied in mainstream open - source vision - language - action models such as Tsinghua RDT, PI π0, Zhiyuan UniVLA, and EquiBot.

Given the current bottlenecks in the accuracy, generalization, and standardization of video data, Shutu Technology is promoting system upgrades in three directions: accuracy improvement, generalization expansion, and ecosystem co - construction. For example, in terms of accuracy, aiming at the pain point of insufficient capture of detailed actions in complex interaction scenarios, the company will use dynamic occlusion modeling and multi - view reconstruction technology to improve the accuracy of trajectory and pose reconstruction from the millimeter level to within 2 millimeters, providing data support for fine - operation tasks.

In terms of generalization ability, to address the differences in structure, degrees of freedom, and control methods among different robot bodies, the company plans to expand the number of adaptable body types to more than 100, covering the entire spectrum of hardware from humanoid robots and dexterous hands to various mobile chassis.

SynaData embodied data extraction (Source/Enterprise)

In terms of ecosystem construction, Shutu Technology is expected to launch the industry's first open - source embodied data set based on real - scene videos in the fourth quarter of 2025, aiming to connect the entire chain of data production, simulation training, and system deployment and jointly build the next - generation embodied intelligence data infrastructure with leading simulation environment partners.

Lin Xiao, the CTO of Shutu Technology, told Yingke that data determines the upper limit, and the model approaches the upper limit. SynaData will unlock the "data treasure trove" of massive videos on the Internet, helping embodied robots move from "hands - on teaching" to "watching and learning", efficiently obtaining interaction data from the physical world, breaking through the ability ceiling, and providing core data support for robots to enter various industries.

This article is originally produced by「黄楠」， For reprint or content cooperation, please click Reprint Instructions ；Unauthorized reprint will be held accountable.

Embodied AI startup secures multi-million yuan in funding by extracting multimodal embodied data from internet videos and reducing data collection costs to 0.5% of the industry average | Exclusive from Yingke