UC Berkeley Chinese Doctoral Student Demonstrates Humanoid Robot Bridging Visual Perception and Motion Gap for First Time with Unitree G1

There is no need to adapt to the environment in advance, enabling zero-shot deployment.

There's no need to familiarize with the environment in advance. With just a command, you can make Unitree robots sit on chairs, tables, or boxes!

It can also directly unlock tasks such as "stepping over boxes" and "knocking on the door".

This is the latest research result from teams including those from UC Berkeley and Carnegie Mellon University - the LeVERB framework.

Based on simulated data training, it achieves zero-shot deployment, enabling humanoid robots to directly complete full-body movements by perceiving new environments and understanding language instructions.

Traditional humanoid robots either "can understand instructions but can't move" (lacking full-body control ability) or "can only mechanically execute movements but can't understand the environment" (relying on manually preset action libraries).

LeVERB for the first time bridges the gap between visual semantic understanding and physical movement, allowing robots to go from "thinking" to "doing" like humans, automatically perceiving the environment and directly following instructions to complete movements.

The "sitting" movement shown above is completed through "camera perceiving the environment + 'Sit on [chair/box/table]' instruction":

The team also launched a supporting benchmark: LeVERB-Bench.

This is the first "simulation-to-real" visual-language closed-loop benchmark for humanoid robot WBC (Whole-Body Control), including more than 150 tasks in 10 categories.

The team deployed the framework on the Unitree G1 robot for benchmark testing, and the results show:

In simple visual navigation tasks, the zero-shot success rate reaches 80%, the overall task success rate is 58.5%, and its performance is 7.8 times stronger than that of the naive hierarchical VLA (Visual-Language-Action) scheme.

Currently, the LeVERB-Bench dataset has been open-sourced in the LeRobot format, and the complete code of the project will also be released soon.

A two-layer system enables full-body movements from "thinking" to "doing"

Most Visual-Language-Action (VLA) models rely on manually designed low-level action "vocabularies" (such as end-effector poses and root velocities) when controlling robots.

This makes them only able to handle quasi-static tasks and unable to cope with the flexible full-body movements required for humanoid robot whole-body control (WBC).

Put simply, previous robots either had high-level direct control of details (like the brain controlling both walking and thinking, which is inefficient) or the low-level didn't understand semantics (like limbs only following simple commands and unable to handle complex tasks).

Humanoid robots are high-dimensional nonlinear dynamic systems that require a combination of high-frequency control and low-frequency planning. Traditional methods lack effective integration of visual and language semantics.

Therefore, the team proposed to compress and map high-level visual-language instructions into an action vector, that is, an abstract instruction, which can be recognized and executed by the low-level action module.

In the LeVERB framework, this abstract instruction is called "latent action vocabulary".

The LeVERB framework consists of a hierarchical dual system, with the "latent action vocabulary" serving as the interface between the two layers.

The ultimate goal of this method is to keep the "latent action vocabulary" of the two layers consistent, allowing the high-level to focus on "understanding the task" and the low-level to focus on "performing the movement", leveraging their respective strengths.

The LeVERB framework

High-level LeVERB-VL (thinking): A 102.6M full-body action expert based on Transformer. It uses a WBC strategy trained by reinforcement learning, receives high-level latent action instructions, decodes latent verbs into humanoid movements at the dynamic level, and runs at a frequency of 50Hz.

LeVERB-VL is responsible for understanding "what is seen" and "what is heard". For example, when it sees "Go and sit on the blue chair", it will first analyze "Where is the blue chair?" and "How to get there", but it doesn't directly control the movement details. Instead, it transforms the idea into an "abstract instruction".

Through components such as the VLA prior module, kinematic encoder, residual latent space, kinematic decoder, and discriminator, it maps visual and language inputs to a smooth and regular latent vocabulary space, generating latent action plans for motion control.

During training, the model is optimized through three parts: trajectory reconstruction, distribution alignment, and adversarial classification. At the same time, a data mixing strategy is adopted to enhance data diversity, and hyperparameters are finely set to achieve efficient processing of visual - language information and accurate decision-making.

Low-level LeVERB-A (doing): A 1.1M full-body action expert based on Transformer. It uses a WBC strategy trained by reinforcement learning, receives high-level latent action instructions, decodes latent verbs into humanoid movements at the dynamic level, and runs at a frequency of 50Hz.

The function of this part is to transform the latent instructions generated by LeVERB-VL into dynamic-level movements that the robot can execute.

During training, a teacher strategy independent of visual-language is first trained through the proximal policy optimization algorithm. Then, the DAgger algorithm and Huber loss function are used to distill the actions of the teacher strategy into the student strategy (i.e., LeVERB-A) conditioned on latent commands.

During operation, LeVERB-A receives proprioceptive information and latent vectors, uses the Transformer architecture to output reparameterized torque-level joint position action instructions, and implements real-time inference on the robot's on-board CPU in C++ to complete the full-body control of the humanoid robot.

LeVERB-Bench

Without measurement, further work cannot be carried out. The team also specifically proposed a supporting benchmark, LeVERB-Bench, for humanoid robot visual-language whole-body control (WBC) tasks.

In the field of humanoid robot WBC, demonstration data for training VLA models is scarce. Existing benchmarks have many problems, such as only focusing on locomotion, having no vision in the state space, and having a large gap between simulation and reality due to unrealistic rendering, which cannot meet the research needs.

LeVERB-Bench replays and redirects motion capture (MoCap) movements in simulation to collect realistic trajectory data. This method does not require reliable dynamic control during data collection. Kinematic poses can provide task-level semantics and also support the use of redirected humanoid data from sources such as Internet videos.

By using ray tracing rendering technology in IsaacSim, it can more accurately simulate scene lighting and shadows, reducing the problem of the gap between simulation and reality caused by unrealistic lighting in previous synthetic data.

Through a program-generated pipeline, each trajectory is scaled and randomized. The scene background, object attributes, task settings, and camera views are randomized, and some demonstrations are mirrored to ensure data diversity and semantic richness.

Data is manually or using VLM labeled with egocentric text commands. At the same time, VLM is used to label text instructions for pairs containing only movements, increasing language-only data and expanding the data coverage.

LeVERB-Bench includes various task categories, such as Navigation, Towards, Around, Locomotion, Sitting, Reaching, etc.

Classified from two dimensions: visual-language tasks and language-only tasks, it covers a total of 154 visual-language task trajectories and 460 language-only task trajectories. Each trajectory generates a large amount of demonstration data after multiple randomizations.

Through 154 trajectories, each randomized 100 times, 17.1 hours of realistic motion trajectory data were generated. In addition, 2.7 hours of language-only data were added, covering 500 different trajectories, further enriching the dataset.

During evaluation, it will be carried out in 20 random environments. The scene textures and object attributes of each task category are completely randomized and have not appeared in the training data. At the same time, the third-person camera angle is locally randomized to ensure that the evaluation tasks have not visually appeared in the training set, thus testing the generalization ability of the model.

Experimental results

The team deployed the LeVERB framework on the Unitree G1 robot to test its zero-shot closed-loop control ability in real scenarios, making the robot perform tasks such as "Go to the chair and sit down". This verified the transfer ability of LeVERB from simulation to reality, proving the feasibility of the framework in practical applications.

Through evaluation on the LeVERB-Bench benchmark, the LeVERB framework performed excellently. In simple visual navigation tasks, the zero-shot success rate reached 80%, and the overall task success rate was 58.5%, 7.8 times higher than that of the naive hierarchical VLA scheme. This indicates that LeVERB can effectively handle complex visual-language tasks and has good generalization ability in different scenarios.

An ablation experiment was also conducted on the key components of the LeVERB framework to explore the impact of each component on performance. For example, components such as the discriminator (ND) and kinematic encoder (NE) were removed for testing.

Removing the discriminator (ND) led to a significant decline in performance, indicating its importance in aligning the latent space and enhancing the model's generalization ability. Removing the kinematic encoder (NE) also reduced the performance, proving the necessity of the kinematic encoder for supplementing motion detail information.

Half of the team members are Chinese

Half of the members of the LeVERB team are Chinese scholars from UC Berkeley, Carnegie Mellon University (CMU), etc.

Xue Haoru, the main person in charge of the project, graduated with a master's degree from Carnegie Mellon University (CMU) and is currently pursuing a doctoral degree at UC Berkeley.

He has conducted robot research at the MPC Lab and LeCAR Lab and is currently interning at the NVIDIA GEAR Lab.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The humanoid robot has bridged the gap between visual perception and motion for the first time, and a Chinese doctoral student from UC Berkeley demonstrated it on-site with the Unitree G1.

A two-layer system enables full-body movements from "thinking" to "doing"

The LeVERB framework

LeVERB-Bench

Experimental results

Half of the team members are Chinese