HomeArticle

The first-generation robot VLA large model of Xiaomi is here. It's as smooth as Dove, with an inference latency of only 80ms.

量子位2026-02-13 08:49
It can run on a 4090.

So, is there any evening party program in recent days without the presence of robots?

Just in the CCTV Spring Festival Gala, which is a time for family reunion during the Chinese New Year, several robots from embodied intelligence companies have announced their appearance.

Both large and small manufacturers are intensively entering the field, with capital chasing and media promotion... Embodied robots have almost become the center of the next round of technological narrative after large AI models.

The embodied robot industry is indeed at a very interesting point:

On one hand, there is a magnificent visual feast. Various high - difficulty actions are frequently trending, making the public deeply believe in the future of embodied intelligence after "seeing" these performances.

On the other hand, there is an urgent expectation from the industry for "real value". People start to care about when these robots can truly enter factories, handle trivial tasks, and release real productivity.

This expectation actually reflects a paradigm evolution that embodied intelligence is currently undergoing.

For robots to truly become productive forces, the core value ultimately lies in "autonomy". Currently, "manual assistance" or "single - step remote operation" is a reasonable path during the technology verification stage and also helps to accumulate data and experience.

However, if a robot frequently pauses during execution and corrects slowly, humans have to intervene frequently, interrupting the automated process.

If each robot requires a human to back it up, then... (No comment.gif)

Only when one person can supervise ten, or even one hundred, or one thousand robots simultaneously, and only when each embodied robot can make continuous decisions, corrections, and executions during long - term tasks, can the highly - concerned embodied intelligence not be just an empty talk.

So, it's not hard to understand why Xiaomi's first embodied VLA large model addresses the issue of intermittent pauses in embodied robots.

With a parameter scale of 4.7B, Xiaomi - Robotics - 0 achieves an inference latency of 80ms and a real - time control frequency of 30Hz, and can run smoothly on a consumer - grade graphics card (4090).

On mainstream benchmarks such as LIBERO, CALVIN, and SimplerEnv in both simulation and real - world environments, Xiaomi - Robotics - 0 has refreshed the state - of - the - art (SOTA) results.

And, the most important thing should be repeated three times:

This model is open - source, open - source, open - source.

Interpretation of the Three Technological Innovations of Xiaomi - Robotics - 0

To achieve the above effects, Xiaomi has made three core technological innovations in Xiaomi - Robotics - 0, focusing on architecture design, pre - training strategies, and post - training mechanisms.

These three parts all point to one goal: enabling robots to understand complex environments and execute actions continuously, stably, and accurately.

Dual - brain Collaboration: Using DiT as the Cerebellum to Generate Continuous Action Blocks at Once

First, there are major changes at the architecture level.

Xiaomi adopts the currently mainstream MoT (Mixture - of - Transformers) architecture, but cleverly divides the work into a "brain" and a "cerebellum".

The brain part is the VLM (Visual - Language Model), which is responsible for overall perception, listening, understanding, and decision - making; the cerebellum part introduces the DiT (Diffusion Transformer) architecture with only 16 layers.

The brilliance of this design lies in that the KV cache output by the brain is passed to the cerebellum, which is specifically responsible for outputting continuous action blocks, thus changing the granularity of action generation.

The traditional discrete token method discretely encodes continuous actions, which may lead to truncation of accuracy and slight discontinuities in the trajectory.

DiT, combined with flow - matching technology, can directly generate continuous action vectors, making actions smoother and more dexterous.

At the same time, by introducing the flow - matching training mechanism, Xiaomi - Robotics - 0 directly learns the probability flow mapping between continuous action distributions during the training stage. The number of sampling steps required during the inference stage is compressed to five steps from the dozens to hundreds of steps usually required by traditional diffusion models (such as DDPM). This significantly shortens the inference chain and provides a basis for low - latency real - time control.

Since DiT and the underlying VLM are both Transformer structures, the KV Cache of the VLM can be directly reused, reducing redundant calculations.

From the overall architecture, the brain and the cerebellum are loosely coupled through the KV cache, ensuring both understanding ability and controlling the amount of calculation.

This loosely - coupled design significantly reduces the inference latency, making the robot's actions not only smooth and dexterous but also achieving a millisecond - level reaction speed. For a model with a total of 4.7B parameters, the inference latency is 80ms, supporting a control frequency of 30Hz, and it can run smoothly in real - time on a consumer - grade graphics card (RTX 4090).

Two - stage Pre - training: Learning Actions while Preserving Visual Understanding Ability

In the second innovation of Xiaomi - Robotics - 0, Xiaomi solves a long - standing problem of "losing one thing while gaining another" in embodied models.

After many models learn a large amount of robot action data, their originally strong visual understanding ability (VL ability) rapidly degenerates, and they can only perform tasks without thinking.

To ensure that the model doesn't become stupid, Xiaomi adopts a two - stage special training during the pre - training stage.

In the first stage, through the Choice Policy and cross - platform robot trajectory data, the VLM can coarsely predict action blocks while understanding images and instructions.

The core of this step is to align the visual feature space with the action space, establishing a mapping between "what is seen" and "how to move".

Meanwhile, by mixing visual - language data during pre - training, the VLM is prevented from forgetting its original visual reasoning ability, establishing an intuition of "seeing this picture, one should have this feeling of operation".

When entering the second - stage fine - grained action training, Xiaomi consciously protects the model's original multi - modal general knowledge ability.

Specifically, in the second stage, the VLM is frozen, and DiT is trained separately for fine - grained generation through flow matching. At this time, the VLM is only responsible for providing stable multi - modal understanding, and the cerebellum focuses on the high - precision generation of continuous action trajectories.

This division of labor ensures that the model still maintains a strong visual - language ability after introducing action capabilities. Thus, when the robot performs tasks, it can both understand complex instructions and plan continuous actions.

For long - term tasks and human - robot interaction, this ability is an essential foundation.

Improved Asynchrony: Solving Action Inertia with Λ - shaped Attention Mask

The third innovation targets the persistent problem of "action deviation". The Xiaomi - Robotics - 0 team introduces an improved asynchronous scheme during the post - training stage.

Traditional asynchronous execution uses the previous action as an input prefix to make action transitions smooth, but it is prone to action inertia. The model overly relies on historical actions, ignores current visual information, and lags in correction when the environment changes.

Xiaomi innovatively introduces Λ - shape attention (Lambda - shaped mask mechanism) during the post - training stage.

We can understand it as installing a scope with a rear - view mirror on the robot:

The actions adjacent to the prefix in the action block look back at previous actions to ensure smooth transitions; the part far from the prefix is forced to focus on the current visual feedback, ensuring that actions are corrected in real - time according to the environment.

This mechanism allows the model to re - examine the environment while ensuring action continuity, achieving "coherence and correctability" in real tasks and reaching an ideal state of smoothness and accuracy.

This improved asynchronous mechanism enables the model to achieve smooth actions, maintain accuracy, and lead in throughput simultaneously.

Hard - core Performance Results in Simulation and Real - world Environments

With the support of the three technological innovations, Xiaomi - Robotics - 0 shows extremely hard - core evaluation results.

First, let's look at the results of Xiaomi - Robotics - 0 on the VLA simulation benchmark.

In the VLA simulation benchmark, which is highly valued in embodied intelligence, Xiaomi almost dominates the field.

In six simulation environments such as LIBERO, CALVIN, and SimplerEnv, Xiaomi - Robotics - 0 comprehensively outperforms about 30 existing leading models, including π0, π0.5, OpenVLA, RT - 1, RT - 2, etc.

(Note: For details, see the paper https://xiaomi - robotics - 0.github.io/assets/paper.pdf)

Whether it is LIBERO, which examines multi - task generalization ability, or CALVIN, which examines long - term operation stability, Xiaomi - Robotics - 0 has refreshed the records, and its success rate exceeds the recognized open - source benchmark π0.5.

Especially in the Libero - Object task, Xiaomi - Robotics - 0 achieves a 100% success rate and ranks among the top in the Libero test machines with an average score of 98.7%.

Next, let's look at the performance of Xiaomi - Robotics - 0 in VLM benchmarks such as MathVista and ScienceQA, which focus on visual understanding and mathematical reasoning.

In nine test sets, including MMBench, MME, POPE, SeedBench, AI2D, M3MU, ScienceQA, MathVista, and ERQA, most indicators of Xiaomi - Robotics - 0 are higher than those of the comparison models.

The model still maintains high scores after introducing action capabilities, which proves that it does not sacrifice understanding ability to obtain control ability.

Of course, for embodied intelligence, the performance in real - world tasks in the physical world is obviously more convincing.

"Folding towels" is a task that is needed in the real world and has high requirements for embodied robots - the robot needs to handle unstructured soft objects.

Xiaomi - Robotics - 0 tested six different towels and maintained a high success rate and high throughput during 30 - minute continuous operations.

For the task of "disassembling Lego", which requires extreme fine - control and high - frequency feedback, the robot needs to first disassemble Lego components into building blocks and then put each block into the corresponding storage box according to its color.

The model also shows a very high completion rate: it achieves a 100% success rate in the MA and LA - 10 scenarios and leads in throughput by about 25%.

Based on the performance data of the three types of test sets, Xiaomi - Robotics - 0 has established a closed - loop from simulation to visual understanding to real - robot operation, and it is already a very mature integrated VLA model.