Former senior executive of Alibaba's robotics division starts a business, secures tens of millions in seed - round financing, focuses on L4 - level embodied intelligence technology applications | Exclusive report from Hard Krypton
Author | Huang Nan
Editor | Yuan Silai
Yingke learned that Hangzhou Yingshen Intelligent Technology Co., Ltd. (hereinafter referred to as "Yingshen Intelligent") recently completed tens of millions of yuan in seed round and seed + round financing in succession. The seed round was invested by Joyuan Asia; the seed + round was jointly invested by Joyuan Asia and Hangzhou West Lake Science and Technology Innovation Investment. The financing funds will be used for the development and training of the robot's right - brain, commercialization, and team building.
Founded in 2024, "Yingshen Intelligent" focuses on the R & D and application of embodied intelligence technology. Based on its self - developed large spatial model and industrial - scenario robots, it provides enterprises with low - cost, highly reliable, modular software - hardware collaborative solutions. It starts from flexible processes in light industry and gradually expands to service industries and various C - end scenarios.
Min Wei, the founder and CEO, was formerly the technical leader of Alibaba's robot team. He built Alibaba's local - life delivery robots from scratch and implemented their operation in scenarios such as buildings, hospitals, and hotels. Many core team members are from Tsinghua University, with years of experience in technology R & D and product application in the fields of artificial intelligence and robotics.
Currently, the advancement and generalization of embodied intelligence are of great significance for its technological implementation. The advancement of embodied intelligence means that agents need to demonstrate more advanced behaviors in complex physical environments, from simple action execution to collaborative processing of complex tasks, with stronger environmental perception, decision - making, planning, and execution capabilities. For example, in industrial production scenarios, robots not only need to accurately complete repetitive assembly tasks but also quickly make adaptive changes according to slight differences in components or temporary adjustments in the production process. Generalization means that robots can apply skills learned in specific scenarios to new ones, such as household cleaning robots that can work efficiently in different environments.
However, affected by various physical laws, diverse object attributes, and complex environmental dynamic changes, to enable robots to perform reliably in complex environments, an in - depth, comprehensive, and accurate understanding of the physical world is required. The large physical - world model plays a key role in this.
By integrating massive multi - modal data, especially visual information, the model can capture the inherent laws and complex features in the real environment during in - depth mining and learning, simulate its motion, interaction, and environmental changes, help robots learn the laws of the physical world, quickly reason and make decisions in new environments, predict action results, and select the optimal solution.
Among them, the large spatio - temporal intelligent model independently developed by "Yingshen Intelligent" constructs a four - dimensional real - world large model through Real to Real. Through large - scale unsupervised pre - training, it has the basic ability to understand and map the physical world.
Language is a highly condensed way of expressing information. In the field of robotics, although language models can obtain semantic understanding ability from large - scale text data, there is a fundamental contradiction between the spatio - temporal continuity of physical actions and the high concentration and discreteness of language symbols.
Min Wei pointed out that humans can supplement information through right - brain mechanisms such as visual perception and physical common sense, while the VLA model can only rely on limited visual - language alignment features for inference, which easily leads to action deviations. Ultimately, the generated instructions deviate from the actual situation in the real world, affecting the accuracy and reliability of the model output.
"This means that in the era of embodied intelligence, if robots continue to use human language, they may be restricted by the way of expression. When our large embodied - intelligence model is smart enough, is it possible to have a new language that is not restricted by human natural language?" Min Wei said.
Based on the above considerations, "Yingshen Intelligent" directly models video data in the large spatio - temporal intelligent model, turns video into a language, and directly extracts the most real information from video data to understand the real physical world, minimizing human intervention. This method can not only improve the accuracy and efficiency of the model but also help reduce information loss caused by the abstraction of natural language.
On the data side, "Yingshen Intelligent" uses a large amount of domestic video data, which can keep the data training cost at a very low level. According to Min Wei, "Yingshen Intelligent" has reasonably arranged multiple cameras in various work scenarios, such as installing cross - view cameras above and in front of workers. These cameras can capture workers' working pictures from different angles and make full use of these video data for 3D spatial modeling, motion capture, and motion - generation model training of robots.
One of the advantages of this training method is that there is no need to purchase additional complex equipment, which greatly simplifies the training process and avoids interfering with the normal production order of the factory, enabling production and training to proceed in parallel.
During this period, the large spatio - temporal intelligent model will generate two parts of data: one part is to capture the position and posture of workers' joints through motion - capture technology and map them to the robot's joints; the other part is to simulate the video data from the workers' perspective and generate training data similar to traditional remote operation. These data are finally used to train the small model on the terminal, which is then deployed on a unified hardware body and applied to task - performing robots in specific scenarios.
Currently, "Yingshen Intelligent" has launched the "Yingshen" series of industrial robots, which can operate stably under different working conditions and have generalization ability.
Min Wei told Yingke that thanks to his work experience in Alibaba's local - life business, "Yingshen Intelligent" is communicating with customers in multiple industries about cooperation needs and has won industrial orders worth tens of millions of yuan. It will first focus on serving scenarios such as factories and continue to expand to industries such as express delivery and hotels. It is expected that more than a hundred robots will be delivered in total by 2025.
In addition, this year, "Yingshen Intelligent" will focus on developing the robot's brain to improve the robot's ability to understand the external world and perform tasks, so as to accelerate the popularization of L4 - level embodied intelligence in daily production and life.
Views of the investors:
Lin Haizhuo, the founding partner and chairman of Joyuan Asia, said that the Yingshen Intelligent team is a research - production - education integrated team from Alibaba and Tsinghua University. They can both look up at the stars, starting from the underlying technology to enable robots to understand the physical world through video language, and also be down - to - earth, steadily promoting the implementation of robots in industrial scenarios. We firmly believe that Yingshen Intelligent will open up a new track in the technology field and promote the inclusive implementation of embodied intelligence technology.