HomeArticle

Can a robot start working after just one demonstration? A joint team from Peking University and BeingBeyond enables the G1 to start working without any prior examples using the "Hierarchical Cerebellum + Simulated Avatar" approach.

量子位2025-11-14 10:36
The G1 robot can learn 10 household chores at once, and DemoHLM cuts the training cost to an hourly level.

Recently, a research team from Peking University and BeingBeyond has proposed the DemoHLM framework, offering a new approach in the field of humanoid robot locomotion and manipulation. With just one human demonstration in a simulated environment, it can automatically generate a vast amount of training data, enabling real humanoid robots to perform generalized operations in multi - task scenarios. This effectively addresses the core pain points of traditional methods, such as reliance on hard - coding, high cost of real data, and poor cross - scenario generalization.

Core Challenges: The "Triple Dilemma" of Humanoid Robot Locomotion and Manipulation

Locomotion and manipulation are the core capabilities for humanoid robots to integrate into the human environment (such as moving boxes, opening doors, and handing objects). However, they have long been limited by three major problems:

  • Low data efficiency: Traditional methods require collecting a large amount of real robot teleoperation data, which is extremely costly and difficult to scale up.
  • Poor task generalization: They rely on task - specific hard - coded designs (such as predefined subtasks and exclusive reward functions), and new development is needed when changing tasks.
  • Difficult Sim - to - Real transfer: Strategies based on simulation training often fail to run stably on real robots due to differences in physical engines and sensor noise.

Existing solutions are either limited to simulation scenarios or require hundreds of hours of real teleoperation data, making it difficult to meet the practical needs of complex scenarios such as households and industries.

DemoHLM: Innovation in Hierarchical Architecture and Data Generation to Solve the Triple Dilemma

The core innovation of DemoHLM lies in the dual - engine of "hierarchical control + single - demonstration data generation", which not only ensures the stability of full - body motion but also enables generalized learning with extremely low data costs.

Hierarchical Control Architecture: Balancing Flexibility and Stability

DemoHLM adopts a hierarchical design of "low - level full - body controller + high - level operation strategy" to decouple "motion control" and "task decision - making":

  • Low - level full - body controller (RL training): It is responsible for converting high - level instructions (such as torso speed and upper - body joint targets) into joint torques, while ensuring the all - around mobility and balance ability of the robot. Optimized based on the AMO framework, it runs at a frequency of 50Hz and can stably handle high - contact scenarios (such as force interaction during grasping and pushing objects).
  • High - level operation strategy (imitation learning): Through visual closed - loop feedback (RGBD camera perceives the 6D pose of objects), it sends task - oriented instructions to the low - level to achieve complex operation decision - making. It supports multiple behavior cloning (BC) algorithms such as ACT and Diffusion Policy, runs at a frequency of 10Hz, and focuses on long - term planning.

In addition, the team designed a 2DoF active neck + RGBD camera (Intel RealSense D435) for the robot, achieving "stable visual tracking" through a proportional controller to imitate the line - of - sight adjustment ability of human operations and avoid perception failure caused by object occlusion.

Single - Demonstration Data Generation: From "One Demonstration" to "Thousands of Trajectories"

The most crucial breakthrough of DemoHLM is that without real data, it can generate a vast amount of diverse training data with just one simulated teleoperation demonstration. The core process consists of three steps:

  • Demonstration collection: Capture human actions through Apple Vision Pro, map them to the Unitree G1 robot in the simulation, and record one successful operation trajectory (including joint poses, end - effector poses, and object poses).
  • Trajectory conversion and segmentation: Disassemble the demonstration trajectory into three stages: "Locomotion", "Pre - manipulation", and "Manipulation", and achieve generalization through coordinate system conversion.

Pre - manipulation stage: Use the "object - centered coordinate system" to ensure that the end - effector can accurately align with the target under different initial poses of the object.

Manipulation stage: Switch to the "body - perception coordinate system" to solve the problem of trajectory generation when the end - effector and the object are relatively stationary during grasping/transporting.

  • Batch synthesis: Randomly initialize the poses of the robot and the object in the simulation, automatically adjust the instructions for each stage and replay them to generate hundreds to thousands of successful trajectories, forming a training data set.

This process is fully automated, avoiding the "data collection hell" of traditional imitation learning. At the same time, by randomizing the initial conditions, it naturally improves the generalization ability of the strategy.

Experimental Verification: Stable Performance from Simulation to Reality

The team conducted comprehensive verification on the simulation environment (IsaacGym) and the real Unitree G1 robot for 10 locomotion and manipulation tasks (such as moving boxes, opening doors, pouring water, and handing objects). The core results are as follows:

Simulation: Positive Correlation between Data Volume and Performance, Strong Algorithm Compatibility

  • Significant data efficiency: As the amount of synthetic data increases from 100 to 5000, the success rates of all tasks have increased significantly. For example, the success rate of "PushCube" has increased from 52.4% to 89.3%, and that of "OpenCabinet" has increased from 18.9% to 67.3%. The marginal benefits gradually converge, proving the efficiency of the data generation pipeline.
  • Flexible algorithm adaptation: It performs excellently on three BC algorithms: ACT, MLP, and Diffusion Policy. The performance of ACT and Diffusion Policy is similar (for example, the success rate of "LiftBox" exceeds 96%), while the simple MLP has slightly weaker performance due to the lack of time - series modeling ability, verifying the compatibility of the framework with different learning algorithms.

Real - World: Stable Sim - to - Real Transfer, Multi - Task Implementation

On the modified Unitree G1 (equipped with a 3D - printed gripper, a 2DoF neck, and a monocular RGBD camera), DemoHLM achieves zero - shot transfer. Among the 10 tasks:

  • Tasks with 100% success rate: "LiftBox" (moving boxes) and "PressCube" (pressing cubes) both achieve 5/5 success, and the operation processes are highly consistent with the simulation.
  • High - stability tasks: "PushCube" (pushing cubes) has 4/5 success, and "Handover" (handing objects) has 4/5 success. Only individual failures are caused by differences in ground friction.
  • Breakthrough in complex tasks: Tasks that require precise force control, such as "GraspCube" (grasping cubes) and "OpenCabinet" (opening doors), have a success rate of over 60%, ranking among the top in similar simulation - training methods.

The key reason is that the high - level strategy adjusts the instructions in real - time through visual closed - loop feedback, offsetting the physical differences between simulation and reality (such as joint tracking errors) and ensuring the consistency of operation behaviors.

Industry Value and Future Directions

The breakthrough of DemoHLM provides key technical support for the practical application of humanoid robots:

  • Reduce implementation costs: Single - demonstration + simulation data generation reduces the training cost from "hundreds of hours of real teleoperation" to "hour - level simulation demonstration", significantly lowering the threshold for industry applications.
  • Improve generalization ability: Without task - specific design, one framework can adapt to multiple scenarios (household handling, industrial assistance, service interaction), accelerating the implementation of robots from the "laboratory" to the "real environment".
  • Promote technological integration: The hierarchical architecture can be compatible with upgrades such as tactile sensors and multi - camera perception, laying a foundation for more complex scenarios in the future (such as occluded environments and flexible object manipulation).

The team also pointed out the current limitations: Relying on simulation data may lead to long - term Sim - to - Real biases. The performance of a single RGB - D camera is limited in complex occluded scenarios, and it currently does not support the operation of unmodeled objects. In the future, they will explore directions such as "hybrid training of simulation and real data" and "multi - modal perception fusion" to further improve robustness.

Summary

With "single - simulation - demonstration - driven generalized locomotion and manipulation" as the core, DemoHLM solves the three major pain points of high training cost, poor generalization, and difficult transfer of humanoid robots through a hierarchical control architecture and an efficient data generation pipeline.

Its real - world implementation verification on the Unitree G1 proves the practical value of the framework, providing an important technical path for the large - scale application of next - generation humanoid robots in household, industrial, and service scenarios.

Paper Link:

https://arxiv.org/pdf/2510.11258

Project Homepage:

https://beingbeyond.github.io/DemoHLM/

This article is from the WeChat official account "QbitAI". The author focuses on cutting - edge technology. 36Kr is published with authorization.