HomeArticle

A former algorithm researcher at Tencent Robotics X started a business and secured three rounds of financing in four months. They aim to introduce humanoid robots into households within 3 - 5 years.

富充2025-11-20 07:33
During his four years at Tencent, Zhu Qingxu fed various types of training data into the embodied model. Finally, he concluded that "the mainstream approach based on teleoperation data training has fundamental flaws."

Text by | Fu Chong

Edited by | Su Jianxun

Zhu Qingxu, a member of the post - 95s generation, was a former researcher at Tencent's Robotics X Laboratory. In June 2025, he left Robotics X and founded Lingqi Wanwu, an embodied intelligence algorithm company.

According to exclusive information obtained by "Intelligent Emergence", Lingqi Wanwu has completed three rounds of financing in just four months since its establishment, with a total financing amount approaching 100 million yuan. The first round was solely invested by Yuanhe Origin; the second round was led by Harmony Partners, with follow - on investments from Inno Angel, Yuansheng Venture Capital, and Jinqiu Fund, and the old shareholder Yuanhe Origin made an over - subscribed follow - on investment; the third round was led by Jinqiu Fund, with follow - on investments from Plum Ventures and Zhuoyuan Asia, and the old shareholder Inno Angel made an over - subscribed follow - on investment, and Harmony Partners also participated.

Recently, Lingqi Wanwu launched a set of demos by combining its own algorithms with Unitree robots.

In the unaccelerated video, the robot performs a series of household chores with a fluency close to that of a real person, from removing mites on the bed to watering the plants on the high shelves of the flower shed while standing on a stool.

The inspiration for these tasks came from a Xiaohongshu theme that Zhu Qingxu saw: "A Day in the Life of a Mom Raising a Child Alone". He selected several tasks that are the most troublesome for humans to complete because these actions almost all require "using both hands and feet", which greatly tests the control ability of the embodied intelligence algorithm over the robot.

After the video was released, the number of reposts exceeded 4,000, and some comments said, "The silicon - based nanny has taken on a tangible form."

△Video demo, Image: Provided by the interviewee

During the interview, Zhu Qingxu put forward many "anti - consensus" views.

"I believe that the robot configuration truly capable of handling household scenarios is still the bipedal humanoid form, and it should be achievable within 3 - 5 years," Zhu Qingxu said.

The household scenario consists of diverse non - standard tasks and environments, which increases the difficulty of learning and generalization for embodied intelligence. In addition, the bipedal configuration itself faces challenges in motion control, balance, and engineering complexity. Therefore, the industry generally believes that the "ultimate scenario" where bipedal humanoid robots enter households to do work will be realized 5 to 10 years later.

Zhu Qingxu firmly believes that bipedal humanoid robots can better complete household tasks. The reason is that the human world is designed for the human body structure, and only the humanoid form can best reuse human data and adapt to complex household environments. Especially for actions such as climbing, straddling, and bending over, wheeled robots can hardly cover these combinations of postures.

Regarding why his prediction of the time when "robots do household chores" is significantly earlier than the industry's, Zhu Qingxu also gave direct reasons.

"Why is the progress of training humanoid robots so slow now, and why are the action completion speeds in many demo videos also so slow? The mainstream training scheme based on tele - operation data has fundamental flaws," he said.

In his view, during tele - operation, the operator needs to hold a remote control device to operate the robot to complete tasks. Since humans think while controlling, the actions that should be completed instinctively become slow and jerky. Training robots with such data will inevitably result in non - fluent performance.

The presentation of these views stems from Zhu Qingxu's past academic and work accumulations.

Zhu Qingxu has a professional background in robot control and research. He graduated from the joint training program of the Swiss Federal Institute of Technology in Zurich and RWTH Aachen University in Germany in 2021.

In 2021, he joined Tencent's Robotics X. In the following four years, he and his team collected data through various means and systematically trained the embodied intelligence model. They found that the model trained based on tele - operation data had relatively low performance in execution efficiency.

In May this year, Boston Dynamics, an American robot company, also questioned tele - operation, believing that tele - operation uses the human "System 2" (slow system) to collect data, which leads to inefficient behavior, lack of dynamics, and many unnecessary actions. This inspired Zhu Qingxu's technical approach.

In Lingqi Wanwu's algorithm, Zhu Qingxu adopts an architecture of "cerebellum" + "brain". The former is responsible for motion control, and the latter is responsible for planning and generalization ability.

Currently, Lingqi Wanwu focuses on breaking through in the cerebellum part, which has received less attention in the industry. By constructing a complete "human action library", it can quickly collect action data and enable the robot to efficiently learn most basic actions (meta - actions).

In terms of real - machine data collection, Lingqi Wanwu abandons the commonly used "tele - operation" in the industry and adopts the "optical motion capture + UMI" scheme instead.

This scheme first uses optical motion capture technology. The operator wears the equipment and performs real actions in the collection space, and multiple cameras record the actions synchronously. This not only can more accurately reproduce the smooth and instinctive behavior patterns of humans but also greatly improves the data collection efficiency in the laboratory environment.

Subsequently, in the real environment, the operator directly manipulates objects with the UMI gripper, which can obtain a large amount of real interaction data between the hand and the object. Combined with the motion capture data from the previous step, a high - quality and scalable training data base is formed.

△The motion capture suit worn by the operator records their body posture. Image source: Provided by the interviewee

When talking about the key to financing, Zhu Qingxu of Lingqi Wanwu pointed out that the technical difference between his company and the current mainstream solutions is the main reason for quickly obtaining investments from mainstream institutions. He said that these investors, who have widely invested in the field of embodied intelligence, still choose Lingqi Wanwu because they value the collaborative value of its technology with other invested enterprises.

Zhu Qingxu further speculated that after the efficiency of this technology is improved, the time for bipedal humanoid robots to enter households will be shortened to 3 - 5 years.

In the nearer future, bipedal humanoids may first enter unmanned stores such as retail and fast - food stores within 1 - 2 years. Because the tasks in such scenarios are fixed and the environment is controllable, it can quickly verify and generate commercial value.

When referring to the barriers of Lingqi Wanwu, Zhu Qingxu summarized: "When everyone was optimistic about tele - operation, we could identify its fundamental flaws and find new ideas; our ability to adhere to the ultimate goal of household and service and turn an immature idea into reality step by step is the real barrier."

This is also Zhu Qingxu's first public statement after starting his business. The following dialogue is from an exclusive interview, and the content has been organized by the author:

△Image source: Provided by the interviewee

Tele - operation has fundamental flaws

Intelligent Emergence: Why do you think "tele - operation" has fundamental flaws? Zhu Qingxu: The core lies in that tele - operation uses the human brain's "slow system" to control the robot. The operator needs to observe, think, and then execute, and this process is essentially slow, jerky, and full of unnecessary pauses.

Training a robot with data from this "slow system" is like making the robot imitate a teacher whose actions are not smooth, and its performance ceiling is locked. The root cause of all robot videos that need to be played at an accelerated speed lies here.

Moreover, for dexterous operations that require tactile feedback, such as unscrewing a bottle cap, since there is no real force feedback in tele - operation, the operator may not know if the robot's hand has reached the most suitable position on the bottle cap, which also reduces the efficiency of the action.

Intelligent Emergence: Since tele - operation has the problems you mentioned, why did it become a widely - adopted solution at this stage? Zhu Qingxu: I think the initial idea was to let the robot directly manipulate objects and obtain real - machine data from the robot. Tele - operation was the first solution to achieve this goal.

Intelligent Emergence: How does your alternative solution "motion capture + UMI" work specifically? What are its advantages? Zhu Qingxu: This is a solution that pursues a balance between data quality and scale.

Optical motion capture: In the laboratory, a person wears a motion - capture suit and naturally (using the "fast system") completes various actions. It can record the smooth and coordinated movement trajectories of the human body with high precision and fidelity.

UMI: It can be understood as a handheld gripper device. The operator uses the UMI device to actually manipulate objects. This can collect a large amount of interaction data between the hand and the object.

By combining the two, the data set we obtain not only has the high quality of motion capture but also the large - scale nature of UMI. It records human subconscious actions, which are the "instincts" that robots should really learn.

In the data pyramid, the data at the top layer is from tele - operation. Although it is collected from real machines, the quantity is scarce. The data at the bottom layer is video data. Although it is large in quantity, due to problems such as perspective and video quality, it may not be efficiently used for training. Our solution precisely provides the data in the middle layer of the pyramid - better in quality than video data and much larger in quantity than tele - operation data.

Intelligent Emergence: After the data is collected, how do the "cerebellum" and "brain" in your algorithm specifically divide and cooperate in their work? Zhu Qingxu: We adopt a hierarchical architecture, which is more in line with the logic of intelligence formation.

Cerebellum (meta - action library): Its goal is to master all basic human actions, such as walking, running, squatting, grasping, and pulling. We train it in a simulation environment without real objects using motion - capture data. Once this "meta - action library" is constructed, it is universal and can be called across all scenarios.

Brain (task planning and generalization): It is responsible for perceiving the environment through cameras, understanding language instructions, planning tasks, and accurately calling the actions in the "cerebellum" skill library to complete tasks.

They are not in a sequential relationship but are coupled and iterative. The richer the skills of the cerebellum, the more tools the brain can call; the smarter the brain, the more accurately it can call the skills.

△Video demo, Image: Provided by the interviewee

From unmanned stores to households within 3 - 5 years

Intelligent Emergence: You mentioned that bipedal humanoid robots will first be applied in unmanned stores within 1 - 2 years. How will this be specifically achieved? How fast can they learn? Zhu Qingxu: In scenarios such as unmanned KFC or unmanned supermarkets, the tasks and actions are limited and can be exhaustively listed.

We can perform and collect all the actions of frying fries, making hamburgers, and restocking in the motion - capture laboratory. Because the quality of our data is high, the robot can learn these atomic actions very efficiently. Taking the KFC scenario as an example, it only takes 2 to 3 days for the robot to learn all the actions of different positions.

After that, we only need to collect data on - site to help the "brain" train its generalization ability. This efficiency is unmatched by tele - operation.

Intelligent Emergence: Motion capture requires setting up an environment full of cameras like a "studio". If I want to teach a robot to fry fries in a KFC, I shouldn't really set up camera stands in the KFC kitchen. So how can this be achieved? Zhu Qingxu: Indeed, there is no need to do so.

All the actions can be collected in the laboratory. Human actions are a "finite set". For example, frying fries can be decomposed into actions such as holding, placing, lifting, and shaking off the oil, which can be collected using the motion - capture equipment.

Then, in the actual scenario, we only need to supplement the UMI actions (interaction between the handheld gripper and the object) and environmental data.

Intelligent Emergence: What is the biggest challenge in generalizing from a closed - loop scenario? Zhu Qingxu: The biggest challenge is the generalization ability. Household environments vary greatly. We need to overcome three types of generalization:

1. Object generalization: Be able to correctly manipulate objects of different shapes, materials, and sizes.

2. Position generalization: Be able to find and handle objects in any corner and at any height.

3. Scene generalization: Adapt to different household layouts, lighting conditions, and furniture styles.

This requires us to collect a large amount of diverse scenario data for the "brain" model. We believe in the Scaling Law, but the premise is that the data quality must be high enough and the quantity must be large enough.

△In the demo, the robot plays frisbee with a child. Image source: Provided by the interviewee

Criticism, barriers, and the future

Intelligent Emergence: Why do you think that the embodied intelligence that will truly enter households and do work in the future will be in the form of bipedal humanoids? Zhu Qingxu: Bipedal robots do have problems and difficulties. For example, they have a higher center of gravity, are less stable than wheeled robots, and are more difficult to control. These are also the main points of contention between humanoid and non - humanoid robots. But overall, I think the advantages outweigh the disadvantages.

We hope that robots can serve humans without modifying the household living environment. Reversing from this ultimate goal, the humanoid form is the most adaptable to the human living environment.

There are some terrains at home that may not be very suitable for wheeled robots. For example, there may be a small step on the balcony or in the kitchen, or there may be a split - level or stairs at home. From the perspective of soft furnishings, if the floor is covered with a thick carpet, it is also not suitable for wheeled robots to move on.

Moreover, for some tasks that require height changes, such as reaching for something on a ladder, watering plants on a flower stand, or bending down to look for something or pick up trash, it is more difficult for wheel - legged robots to do these, and the humanoid form is more suitable. And if a non