A former Zhipu AI Research Institute team has started a business, and Lenovo and Zhipu AI have invested in a humanoid robot large model company | Exclusive Report by Yingke
Author | Huang Nan
Editor | Yuan Silai
Yingke learned that Beijing BeingBeyond Technology Co., Ltd. (hereinafter referred to as "BeingBeyond", with the English name BeingBeyond) recently completed a financing of tens of millions of yuan. Lenovo Star led the investment, followed by Zhipu Z Fund, Yanyuan Venture Capital, and Binfu Capital. Potential Capital served as the exclusive financial advisor. The funds will be used to increase investment in core technology R & D, accelerate the iteration of existing models and industrialization verification, so as to continuously enhance the technological barrier and product competitiveness.
BeingBeyond was established in January 2025, focusing on the R & D and application of general large models for humanoid robots. The founder, Lu Zongqing, is a tenured associate professor at the School of Computer Science, Peking University. He once served as the director of the Multimodal Interaction Research Center at the Beijing Academy of Artificial Intelligence and was in charge of the first general intelligent agent project under the Original Exploration Program of the National Natural Science Foundation of China. Many core members are from the Beijing Academy of Artificial Intelligence, with rich technological R & D experience and application implementation experience in fields such as reinforcement learning, computer vision, robot control, and multimodality.
Currently, the data scale and generalization ability are the core contradictions restricting the performance improvement of the embodied brain. On the one hand, for embodied intelligent robots to achieve highly anthropomorphic action and decision - making abilities, they rely on a large amount of diverse data for in - depth training. This data covers various scenarios such as daily trivial operations and complex environment interactions, and the data scale is increasing exponentially. However, the data collection process still faces multiple thresholds such as technology and resources. It relies on a large amount of manpower and is difficult to carry out. The storage cost is also rising rapidly with the surge in data volume.
On the other hand, even with the support of a large amount of data, for robots to flexibly handle new tasks, new objects, and new interferences in unknown environments, they still rely on strong generalization ability. However, when existing models face significantly different scenarios, their performance is mediocre. It is difficult to effectively transfer the learned knowledge to new situations, and their adaptability in practical applications is poor.
Therefore, how to improve the generalization ability with a limited data scale has become a key challenge for the embodied brain to break through the performance bottleneck and move towards practical application.
The pre - trained data used by BeingBeyond (Source/Enterprise)
For the two core abilities of operation and movement of humanoid robots, BeingBeyond divides its general large - model system into three layers: the embodied multimodal large - language model, the multimodal pose large model, and the motion model, and builds a self - learning embodied intelligent agent framework.
Lu Zongqing told Yingke that different from other models, the pre - trained data of BeingBeyond comes from human motion and hand - operation videos on the Internet. By analyzing the action sequences in these natural scenarios, a pre - training foundation for the robot's motion and operation ability is built. This technology route driven by public video data breaks through the strong dependence of traditional solutions on real - machine data of robots and can achieve cross - modal migration from "human behavior demonstration" to "robot action generation".
Specifically, BeingBeyond proposed a multimodal pose model. Through the rich video resources on the Internet, including full - body human movements such as walking and dancing, and fine hand - operation data from the first - person perspective such as grasping objects and using tools, it can provide rich and diverse action samples for the model. Through this video - action data, the model can learn the manifestation forms of various actions in different environments and can achieve generalized end - to - end motion operations based on real - time environmental information and task requirements.
In terms of the embodied multimodal large - language model, BeingBeyond independently developed the Video Tokenizer technology, which emphasizes the understanding and reasoning ability of the spatio - temporal environment, especially the analysis of video content from the first - person perspective. By deconstructing the continuous video stream into visual token units with both time series and spatial semantics, this model can accurately capture the sequential logic of actions, such as the continuous process of reaching out, raising the arm, and grabbing an object, and understand the physical world and human behavior based on spatial features such as the object's position and the relative position of the limbs.
Currently, although the simple multimodal large - language model + motion operation strategy already meets the conditions for commercial implementation, affected by the dynamic environmental changes in real scenarios, the generalization ability of robots is difficult to adapt. How to enable humanoid robots to have autonomous learning ability has become the key breakthrough point for their commercial implementation.
For this reason, BeingBeyond proposed the Retriever - Actor - Critic framework. Through the collaborative application of RAG (Retrieval - Augmented Generation) of real interaction data and reinforcement learning, it can not only improve the response accuracy of the model and the user experience, but also form a closed - loop of "data collection - model optimization - effect feedback", enabling the robot to have the ability to dynamically adapt to changeable scenarios and providing a feasible technology route for its large - scale implementation.
Pre - training + post - training architecture (Source/Enterprise)
Lu Zongqing pointed out that based on the general action model pre - trained with Internet videos and then through later adaptation training to achieve migration to different robot bodies and scenarios, BeingBeyond's technology route can avoid data waste caused by hardware iteration, and effectively solve the contradiction between the scarcity of real - machine data and scenario generalization. Currently, the company is promoting scenario verification cooperation with leading robot manufacturers to accelerate the application and implementation of embodied intelligence in more fields.
Views of investors:
Gao Tianyao, Partner of Lenovo Star said that currently, the technology route of embodied large models has not converged, such as the lack of a unified architecture paradigm. The technology route of the BeingBeyond team solves the problem of limited sources of training data. At the same time, it uses a modular way to connect the big and small brains to build a complete technology framework. Compared with teams with similar technology routes abroad, it has full - stack technological capabilities. Relying on self - developed large models such as multimodal large models, it has strong competitiveness in solving problems such as task and environment generalization, and cross - body issues of embodied large models, and gradually achieves "zero - sample" generalization. We look forward to the implementation of the products of the BeingBeyond team in high - potential application scenarios and the realization of a commercial closed - loop.
Wang Pu, Partner of Zhipu Z Fund said, "As an angel investor in BeingBeyond, I am extremely proud to witness the milestone breakthrough achieved by Professor Lu Zongqing and his team in the field of general humanoid robots. From building the industry's first MotionLib dataset with a scale of one million to developing the end - to - end Being - M0 action generation model, the team has not only verified the scale effect of 'big data + large model' in embodied intelligence but also achieved a technological closed - loop for cross - platform action migration. This innovation's ability to convert text instructions into fine robot actions not only breaks through the limitations of traditional methods but also paves the way for robots to enter thousands of households. I firmly believe that BeingBeyond will continue to lead the iteration of embodied intelligence - from dexterous operations to full - body motion control, and promote robots to move from the laboratory to daily life. We will join hands with BeingBeyond and everyone to jointly welcome a new era empowered by general robots."