For the first time, the dexterous manipulation of a VLA pre-trained purely on human videos can be successfully deployed with only a small amount of fine-tuning data.
Achieving dexterous manipulation capabilities at the human level is one of the long - standing core challenges in the field of robotics.
Although multi - fingered dexterous hands have the potential similar to humans in terms of hardware, due to the high cost of obtaining high - quality robot motion data, the existing Vision - Language - Action (VLA) models lag far behind large language models (LLM) and vision - language models (VLM) in terms of data scale and diversity, making it difficult to meet the requirements of complex real - world tasks.
The latest research paper Scalable Vision - Language - Action Model Pretraining for Robotic Manipulation with Real - Life Human Activity Videos by the Microsoft Research Asia (MSRA) in collaboration with Tsinghua University proposes an innovative pre - training framework VITRA to address this key issue.
The core contribution of this research lies in proposing a fully automated solution to convert a large amount of unlabeled real - life human activity videos into data that is completely aligned with the format of existing robot V - L - A training data.
By extracting 3D hand motion trajectories from videos, performing atomic - level action segmentation, and automatically generating language instructions, the research team constructed an ultra - large - scale hand V - L - A dataset containing 1 million segments and 26 million frames.
After pre - training on pure human video data, the model demonstrates strong zero - shot hand motion prediction ability in completely unseen real - world environments.
With only a small amount of real robot data for fine - tuning, highly successful dexterous manipulation can be achieved on real robots, and the model shows strong generalization ability towards new objects and new environments.
Here are more detailed contents.
Establishing the conversion link from human videos to robot data
The core problem of the paper is how to overcome the huge differences between unstructured human videos and structured robot data, so as to extract high - quality action labels and language instructions for VLA model pre - training.
This research constructs a complete system consisting of three core technologies, achieving seamless conversion from raw videos to V - L - A data.
△
3D motion annotation: Accurately restoring hand and camera trajectories
Restoring accurate 3D hand motion from monocular, uncalibrated, and possibly moving camera videos is a highly challenging task.
This research proposes a monocular camera and hand pose tracking method based on the latest 3D vision technology:
First, the camera state is determined through background optical flow, and the camera intrinsic parameters are estimated.
Subsequently, the camera pose is tracked using depth vision SLAM and depth estimation models, and the 3D hand pose in the camera space (including the 6D wrist pose and full joint angles) of each frame is extracted using a hand reconstruction model.
Finally, by combining this information, the 3D hand motion trajectory in the world space is obtained.
This method not only provides high - precision action labels but also lays the foundation for subsequent action segmentation and instruction annotation.
Atomic - level action segmentation: Natural segmentation based on velocity minima
Existing robot V - L - A data usually consists of simple, short - sighted atomic - level tasks. How to accurately segment these atomic actions from long videos is a difficult problem.
Inspired by the natural rhythm of human actions, the research team proposed a simple and efficient segmentation algorithm: segmentation based on the minima of the hand movement velocity in 3D space.
During action transitions, the human hand usually experiences velocity changes, and the velocity minima often mark the switching of actions.
By detecting the velocity minima of the 3D wrist trajectory in the world space, this method can efficiently segment long videos into short segments containing a single atomic action without any additional manual annotation or model inference.
Instruction annotation: Precise action description combined with 3D trajectories
To generate accurate language instructions for the segmented video segments, the research team cleverly combined the vision - language model (VLM) and 3D hand trajectories.
For each video segment, the system uniformly samples 8 frames of images and projects and superimposes the 3D trajectory of the palm onto the images.
Then, these images with highlighted trajectories are input to GPT - 4, prompting it to describe the actions of the specified hand in the form of an imperative sentence based on the image content and trajectory information.
Experiments have proven that providing atomic - level video segments and superimposing 3D hand trajectories can significantly improve the accuracy of GPT in generating action descriptions.
Achieving powerful zero - shot prediction and real - world generalization
Based on the above automatically constructed ultra - large - scale human hand V - L - A dataset, the research team designed and trained a VLA model specifically for dexterous manipulation.
△
1. Model architecture combining VLM and diffusion action expert
This VLA model consists of a VLM backbone network (PaliGemma - 2) and a diffusion action expert (Diffusion Transformer, DiT).
VLM& receives visual observations, language instructions, and camera field - of - view (FoV) information and outputs a "cognition feature".
The diffusion action expert receives this cognition feature, the current hand state, and a noisy action block with a mask, and predicts the future hand action sequence through iterative denoising.
To handle fast - moving human hand actions and adapt to short - segment data, the model uses a causal attention mechanism for action denoising, ensuring that the prediction of each action step only depends on the previous actions and effectively avoiding the negative impact of zero - padding.
2. Zero - shot hand motion prediction: Demonstrating amazing ability in unseen environments
In completely unseen real - life environments, the pre - trained model demonstrates strong zero - shot hand motion prediction ability.
△
In the evaluation of grasping tasks and general action prediction tasks, this model significantly outperforms models trained on data collected in laboratory environments (such as EgoDex) and models trained on original human - annotated data.
This fully demonstrates that pre - training with a large amount of diverse real - life videos can greatly enhance the model's generalization ability towards complex environments and unknown objects.
3. Real - robot dexterous manipulation: Efficient deployment with a small amount of data for fine - tuning
To deploy on real robots, the research team aligned the action space of the human hand with the action space of the robot dexterous hand (such as the Xingdong XHAND1 equipped on the Realman robot).
△
With only a small amount (about 1,200) of real - robot teleoperation data for fine - tuning, the pre - trained model can perform a variety of dexterous manipulation tasks in the real world, including grasping, placing, pouring, and sweeping.
Experimental results show that compared with models not pre - trained with human VLA data or models pre - trained on other datasets (such as OXE, EgoDex), this method has significantly improved the task success rate and demonstrated excellent robustness, especially when facing unseen objects and backgrounds.
Hardware core support for VITRA's real - world deployment
The reason why the VITRA framework can achieve amazing generalization ability on real robots is not only due to the innovation at the algorithm level but also relies on the underlying hardware -
The powerful support of the self - developed domestic first full - direct - drive five - finger dexterous hand Xingdong XHAND1 by Xingdong Jiyuan.
This framework forms a perfect "software - hardware collaboration" with the hardware features of Xingdong XHAND1, showing irreplaceable implementation advantages in practical application scenarios.
△
Seamless docking of high - precision URDF with the human hand action space
The core breakthrough of the VITRA framework lies in aligning the human hand action space with the action space of the robot dexterous hand.
The official of Xingdong XHAND1 provides a highly accurate URDF model, which not only precisely describes the motion and dynamic parameters but also perfectly maps the spatial distribution of human hand joints.
This "digital twin" - level model support enables VITRA to accurately map human joint angles to the corresponding joints of Xingdong XHAND1 during the fine - tuning stage, thus significantly reducing the gap between human videos and real hardware and ensuring the efficient deployment of the pre - training strategy on real hardware.
Full - direct - drive architecture and high - frequency response: Perfectly executing complex dexterous operations
When performing complex dexterous manipulation tasks such as pouring and sweeping, the robot needs to have extremely high dynamic response capabilities.
The full - direct - drive (Direct - Drive) motor architecture adopted by Xingdong XHAND1 provides the most ideal hardware foundation for this algorithm.
The full - direct - drive design fundamentally eliminates the huge friction, hysteresis, and nonlinear interference caused by traditional reducers, endowing the dexterous hand with ultra - sensitive dynamic response capabilities. This enables Xingdong XHAND1 to instantly and accurately execute the action instructions output by the VITRA model and safely operate various unknown objects.
Rich sensor array: Reserving space for future multi - modal perception
Although the current VITRA model mainly relies on visual input, the rich sensor array (such as a high - resolution tactile array) equipped on Xingdong XHAND1 reserves a broad space for future multi - modal perception.
Combined with the powerful hardware perception ability of Xingdong XHAND1, future VLA models are expected to further integrate tactile feedback and handle more delicate and complex "finger gaits" tasks.
Scaling law of data scale
This research also deeply explores the impact of pre - training data scale on model performance.
△
Experiments have found that as the amount of pre - training data increases, the error of the model in the zero - shot hand motion prediction task steadily decreases, and the success rate in real - robot operation tasks continues to increase.
This obvious scaling behavior indicates that by further expanding the scale of human video data, it is expected to continuously improve the performance of the VLA model.
This achievement marks a key breakthrough in pre - training robot VLA models using unstructured human videos.
By providing a fully automated data conversion solution, this research significantly reduces the threshold for obtaining high - quality robot training data, paves the way for the application of multi - fingered dexterous hands in a wider range of real and complex scenarios, and lays a solid foundation for the development of truly generalized embodied intelligence.
Paper link: https://arxiv.org/abs/2510.21571
This article is from the WeChat official account "QbitAI", author: VITRA team. Republished by 36Kr with authorization.