Modeled after NVIDIA's EgoScale data path, Star Memory Technology, incubated by Tsinghua University, has received its first-round financing.
Text by | Ren Qian
The global competition in the embodied data layer is heating up rapidly. NVIDIA Research released the EgoScale data and training framework in 2026, training the VLA model on ego - centric human operation videos. Using 20,854 hours of first - person human videos with action annotations, they observed a near - log - linear scaling law between data scale and validation loss. 1X collects first - person human and household behavior data, and through the Sunday project, it collects millions of hours of household scenario videos. Guanglun Intelligence adopts a hybrid approach of simulated synthetic data and human video data (EgoSuite), claiming to have cumulatively delivered over one million hours of data, and its valuation has soared to billions of US dollars.
Within just a few months, the industry's focus has shifted from "who can collect more data" to "who can truly turn human - centric/ego - centric data into high - freedom, high - precision, low - cost, and trainable assets."
Behind this is a clear shift in the data paradigm. In the past year, global leading players have almost simultaneously turned their attention to human - centric data: not just larger - scale third - person materials, nor just expensive and scarce real - machine teleoperation, but data that is closer to the real distribution of human operations. And among them, ego - centric data, with the first - person human perspective, real physical interaction, and multi - modal perception at its core, is rapidly becoming the most crucial data collection route.
The reason is that what robots ultimately need to learn is not just to understand the world, but to perform actions correctly in the real physical world. Third - person videos lack details of contact and control, simulations can't fully cover the long - tail of real - world physics, and pure teleoperation data is expensive and scarce. What is truly scarce is data that is both real enough, detailed enough, and can be produced on a large scale and directly digested by models. At this inflection point, a company that chooses to tackle this difficult problem from the perspective of multi - modal fusion and wearable high - precision data collection is starting to emerge.
According to exclusive information obtained by "AnYong Waves", Xingyi Technology, a startup focusing on ego - centric data collection, has completed its first - round financing of tens of millions of yuan. The financing was led by Shuimu Venture Capital from Tsinghua University. Quanshi Capital, as the incubator, has long provided industrial and capital support to the company and participated in this round of investment. Keyzhuo Capital from Shenzhoutongyu and senior industrial angel teams also participated in the follow - on investment. Maple Pledge Capital has long served as the private equity financing advisor for the company.
Xingyi Technology was incubated from the Department of Computer Science at Tsinghua University. Its founder, Song Zhiheng, once served as the product manager of the full - size bipedal humanoid robot at Zhiyuan Robotics and was responsible for the construction of relevant data collection and teleoperation systems. Before that, he was one of the first 20 employees at Megag Robotics, where he established the innovation application department and served as the product manager. He led the R & D team to complete the development of new products from scratch five times, led the R & D of products from dual - arm collaborative robots to desktop - level intelligent devices, and achieved the company's first mass production of ten thousand units and over 100 million yuan in revenue.
If human - centric/ego - centric data is becoming the new foundation for embodied intelligence, then what makes Xingyi stand out is not just that it has bet on the right direction, but that it has managed to integrate the most difficult links in this direction within the same organization. Its core members cover key aspects such as embodied data, models, wearable devices, complex systems, and data engineering, forming an ability structure that integrates "data - model - product - commercialization".
The technical team members come from universities such as Tsinghua University and Beihang University, and also include senior industrial experts from companies like AFT and Hikvision. They have long - term research in areas such as embodied intelligence, multi - modal perception, three - dimensional hand understanding, virtual reality, human - computer interaction, and computer vision. They have published over 70 papers in top international conferences and journals such as CVPR, ICCV, ECCV, NeurIPS, and IJCAI, and have undertaken several national - level scientific research projects.
Following NVIDIA's EgoScale technical path, Xingyi has built a hardware and software system for data collection for embodied intelligence and world models. Its differentiation lies in: not following the two - finger gripper UMI path, but achieving high precision on the basis of high freedom; not only collecting visual data, but also integrating vision, touch, and posture; not just providing tools, but trying to establish a complete closed - loop from data collection to training.
Song Zhiheng believes that the truly valuable real - machine data is not about who collects more, but about who can meet five conditions simultaneously: real, precise, high - freedom, low - cost, and trainable. In his view, Xingyi's most prominent current advantages are concentrated in precision and freedom, while low - cost and trainability determine whether this path can truly achieve large - scale development.
Not long ago, "AnYong Waves" met Song Zhiheng and Xingyi's self - developed multi - modal data collection wearable device in Zhongguancun, Beijing. He talked with us about the fundamental differences in the data collection technology routes, the difficulties of millimeter - level posture annotation, and the long journey from being a data provider to becoming an interface to the physical world.
The following is the dialogue —
Part 01
From collecting more to collecting accurately
"AnYong": There are many companies doing data collection, and some have much larger financing volumes than yours. What is the positioning of Xingyi Technology?
Song Zhiheng: We are the physical data infrastructure for embodied intelligence. Through our self - developed high - precision wearable devices and data engine, we transform the delicate "productivity experience" of humans into "digital nutrients" that robots can learn from.
The core is just one thing: to enable robots to have the ability to perform fine operations in the real and complex world. It's not about making robots dance, but about making them hold a scalpel as steadily as a surgeon.
"AnYong": Why did you choose to start a business at this moment and from the data aspect? What did you see at Zhiyuan?
Song Zhiheng: I served as the product manager of the full - size bipedal humanoid robot at Zhiyuan and was also responsible for data collection and teleoperation. We could clearly see that the most common scenarios in the industry were still exhibition halls, commercial performances, scientific research, and data collection, and it was difficult to form a replicable productivity closed - loop.
The core bottleneck is the lack of high - quality real data: the model lacks both an effective representation of the physical world and transferable operational priors, and what we do is to fill this gap. From an external perspective, we are following the EgoScale path; from an internal perspective, this was a judgment formed early on: what embodied intelligence ultimately lacks is not just models and bodies, but the establishment of the most efficient data path. NVIDIA's public promotion of this path also indicates that it is becoming an industry consensus.
"AnYong": Why is EgoScale so popular? Why are all embodied body companies actively paying attention to this technical path? What is so special about the EgoScale framework, and what are its breakthroughs?
Song Zhiheng: The reason why EgoScale has quickly become popular is that it has verified a very attractive path: through ultra - large - scale first - person human data, it can achieve efficient transfer from human behavior to robot operation ability. This is very important for embodied intelligence because in the past, robot training has long been limited by the high cost, slow collection, and limited coverage of real - machine data, and it has been difficult to truly scale up.
The breakthrough of EgoScale lies in that it doesn't simply pile up data, but builds a more systematic training framework. Through phased training, it first learns general behavioral priors from a large amount of first - person human operation data, and then further transfers to the robot action space, significantly improving the success rate of robots in dexterous operation tasks. Such a design gives it the opportunity to break through the limitations of the traditional "small - sample, heavy - teleoperation, and strong - dependence on body data" approach.
More importantly, this path naturally meets the most core requirements of the embodied industry: on the one hand, human data is easier to obtain on a large scale than robot data; on the other hand, this framework has strong generalization potential for robot bodies of different forms and degrees of freedom. For body companies, the one that can more efficiently obtain transferable, expandable, and reusable data and training paradigms will have a better chance of taking the lead in the next - stage ability competition. This is why the entire industry is highly concerned about EgoScale.
"AnYong": Is there a difference between you and EgoScale? Where is it?
Song Zhiheng: Yes, we have more modalities. Touch is essential for fine operations. At the same time, we have higher compatibility with scenarios. We are not limited to the laboratory. It's EgoScale in the wild, with almost no constraints on scenarios. We can directly wear our devices in real production scenarios for data collection, which poses higher challenges to both the algorithm and the wearing comfort.
Xingyi EgoKit multi - modal data collection kit and Xingyi HBR Engine data engine | Image source: Provided by the enterprise
"AnYong": How to understand "world - class"?
Song Zhiheng: What determines the upper limit is not just the model parameters, but the quality of the teacher signal: multi - modal collection, fine - grained hand understanding, and high - precision annotation. These are the fundamentals of high - quality embodied data. Imagine if the demonstration actions themselves have jitters, offsets, and timing errors, the model will learn errors instead of abilities.
Human posture estimation is often a centimeter - level problem, while hand posture estimation often needs to reach the millimeter - level: there are more joint points, more occlusions, and more complex hand - object contacts. The technical difficulty doesn't increase linearly, but exponentially.
That's why hand understanding is one of the most difficult aspects of embodied data, a Level L4 - L5 technology. We happen to have the world's best ability in this area, while human posture estimation is Level L2. On the basis of doing this layer deeply and thoroughly, expanding upwards to the upper limbs and even the whole body is a smoother path.
"AnYong": Why do you have to do multi - modal fusion (vision + touch + posture)? Isn't pure vision enough? Can't large models already understand the world?
Song Zhiheng: It's not that the model isn't smart enough, but that it has never truly "touched" the real world. Fine operations require at least three types of information: three - dimensional vision, body posture, and touch.
Three - dimensional vision tells you where the object is, and posture tells you how the hand and arm reach there. When it comes to the moment of contact, touch often determines success or failure: whether there is contact, whether it slips, how much force to use, and when to reduce the force. Touch provides information about the contact state, friction changes, and micro - slips. It is the end of vision and the start of force control.
"AnYong": I heard that you can do gesture recognition while wearing gloves. Is this difficult? Aren't Meta and Apple also working on it?
Song Zhiheng: It's extremely difficult. Meta uses flesh - colored gloves, which essentially let the model recognize them as "thicker human hands". We can use black gloves, and the model can recognize them as hands in the feature space and accurately analyze the posture. Apple's gesture technology is very strong, but its public route still focuses on bare - hand interaction.
Why is this important? Because the most natural carrier of touch is gloves. If we can't stably complete hand understanding while wearing gloves, we can't truly integrate vision, touch, and posture. The difficulty behind this is not just the recognition itself, but that the multi - modal system needs to balance accuracy, latency, and cost simultaneously.
"AnYong": You mentioned "millimeter - level annotation". What specific accuracy can you achieve? How does the cost compare with traditional methods?
Song Zhiheng: For high - density and highly occluded tasks like hand annotation, both traditional manual annotation and general open - source algorithms have difficulty balancing accuracy and consistency. We can stably push the annotation ability of our data engine to the millimeter - level under long - sequence and strong - contact conditions, and our annotation ability is more consistent than that of human experts.
In terms of cost, for manual annotation of one second of video (30 frames) from three perspectives, even if it costs 0.1 yuan to annotate one picture, it will cost 3 yuan per second, and 180 yuan per minute. Our powerful annotation engine costs only a few hundredths of the cost of traditional manual annotation, but with higher accuracy. This is the double - flywheel of "low - cost + high - quality".
"AnYong": Why don't you do simulated data? Isn't NVIDIA also promoting the transfer from simulation to reality?
Song Zhiheng: Simulation is very valuable for pre - training, strategy search, and parallel trial - and - error. However, once it comes to complex contacts in the real world, the sim - to - real gap is still significant.
For example, accurately inserting a flexible cable that can bend, rebound, and slip like a noodle into a millimeter - level interface and completing the buckling at once involves contact, deformation, friction, occlusion, and continuous feedback correction, which is difficult to fully reproduce in simulation. NVIDIA's promotion of sim - to - real is definitely in the right direction, but in essence, it's not about "replacing the real with simulation", but making the simulation closer to the real, which still requires a large amount of real data for continuous alignment and calibration.
We believe that truly valuable real - machine data needs to meet five conditions simultaneously: real (physical interaction), precise (fine operation), high - freedom (generalization), low - cost (scalability), and trainable (standardized processing). All five conditions are indispensable, and simulated data fails at the "real" level.
"AnYong": What is your specific data collection process? How do you ensure low cost?
Song Zhiheng: Traditional real - machine teleoperation requires renting venues, buying equipment, and hiring people, which is extremely costly.
We have a streaming process: data collectors or workers wear our wearable kits and operate in real production lines or scenarios. The data engine captures vision, touch, position, and trajectory in real - time and aligns them at the millisecond level to form multi - modal training data that can be further tensorized. Then, our offline toolchain will automatically perform "millimeter - level annotation", filter out invalid noise, and form high - quality data that can be directly used for embodied model training.
"AnYong": The real environment is uncontrollable. How do you ensure data quality and security? Will the data be open - sourced?
Song Zhiheng: We have an embedded "quality audit engine" that automatically eliminates actions with jitters, frame drops, and illogical logic. Regarding open - sourcing, Xingyi has a clear plan: we will gradually open - source 1000 to 10000 hours of high - precision datasets this year. We believe that the prosperity of embodied intelligence cannot rely on "isolationism", and we need to promote the industry to jointly build the foundation.
"AnYong": You mentioned two "pyramids" — one is the pyramid of robot capabilities, and the other is the pyramid of data. What do they mean respectively? Which layer does Xingyi Technology target?
Song Zhiheng: We do use two "pyramids" internally to understand embodied intelligence.
The first is the capability pyramid: from bottom to top, the body is the base, above it is motion intelligence, and further up is cognitive intelligence. If cognitive intelligence is further divided, it can be divided into interaction intelligence and operation intelligence. The former solves the problems of "understanding and expressing", and the latter solves the problem of "performing goal - oriented and constrained operation tasks in the real physical world". What truly determines the upper limit of the embodied system is the operation intelligence layer.
The second is the data pyramid. The bottom layer is Internet data, which has the largest scale and provides semantic and common - sense priors. Above it is simulation/synthetic data, which is suitable for pre - training, strategy search, and parallel trial - and - error. Further up is multi - modal real data represented by first - person human data.