Shocking! Robots Have Learned to Predict the Future

The fourth open-source release from Ant Lingbo: LingBot-VA

Remarkably, robots are now starting to learn to visualize the future.

This is yet another amazing open - source achievement of LingBot by Ant (for the fourth consecutive day) —

The world's first causal video - action world model for general robot control, LingBot - VA.

How do they visualize the future?

Put simply, previous robots (especially those based on VLA) worked mainly through a reflex mechanism: they would move their hands immediately upon seeing something.

This is called the "observe - react" mode.

However, LingBot - VA breaks this way of thinking through autoregressive video prediction. Before taking action, it first simulates the scenes of the next few seconds in its "mind".

To be honest, making decisions based on imagination is quite novel in the field of robot control.

But this is not the only highlight of LingBot - VA. It also includes:

Memory retention: When performing long - sequence tasks (such as making breakfast), it remembers what it has just done and has a strong sense of its own state.

Efficient generalization: It can adapt to new tasks with just dozens of demonstration samples. It can also handle different robot bodies.

Therefore, with the support of LingBot - VA, robots can easily handle high - precision tasks such as cleaning small transparent test tubes.

As we mentioned earlier, today is the fourth consecutive day of LingBot by Ant's open - source releases.

If the previous open - source projects strengthened the robots' eyes (LingBot - Depth), brains (LingBot - VLA), and world simulator (LingBot - World) [add three hyperlinks], then today's LingBot - VA endows this "body" with a soul —

A world model in action that turns imagination into execution.

In this way, LingBot by Ant has raised the ceiling of general robots.

As netizens have said:

From prediction to execution; to be honest, this is a huge leap.

Let imagination take the lead

LingBot - VA has chosen a more advanced path in its architectural design.

In the traditional VLA (vision - language - action) paradigm, the model usually processes three complex tasks, namely visual understanding, physical change reasoning, and low - level action control, in the same neural network. This is known as representation entanglement in academia.

In order to achieve higher sample efficiency and stronger generalization ability, LingBot - VA decides to untangle this mess and proposes a brand - new solution: Imagine the world first, then reverse - engineer the actions.

To implement this idea, the LingBot by Ant team adopts a two - step strategy:

Video world model: First predict the future visual state (what will happen next).

Inverse Dynamics: Based on the visual changes, infer what actions should be taken (how the hands should move to achieve this scene).

This is fundamentally different from the traditional VLA: it does not directly jump from the "present" to "action" but goes through the step of the "future".

How to achieve this? The LingBot by Ant team mainly focuses on three architectural aspects as breakthrough points.

First is the autoregressive interleaved sequence of video and action.

In the LingBot - VA model, video tokens and action tokens are placed in the same time sequence.

To ensure logical rigor, the team introduces causal attention. It's like setting a strict rule for the model: it can only use past information and must never peek at the future.

Meanwhile, with the help of the KV - cache technology, the model has a strong long - term memory. It clearly remembers what it did three steps ago and will never forget the task.

Second is the division of labor and cooperation of Mixture - of - Transformers (MoT).

This step is mainly to solve the problem of representation entanglement we mentioned earlier.

We can understand this process as a kind of "self - combat" but very harmonious cooperation:

Video stream: Wide and deep, responsible for heavy - duty visual simulation.

Action stream: Light and fast, responsible for precise motion control.

These two streams share the attention mechanism and exchange information, but remain independent in their respective representation spaces.

In this way, the complexity of vision will not interfere with the precision of actions, and the simplicity of actions will not reduce the richness of vision.

Finally, there is the work related to engineering design.

After all, theory alone is not enough. "Practice is the sole criterion for testing truth":

Partial Denoising: When making action predictions, there is no need to render the future scenes in high - definition every time. The model learns to extract key information from intermediate states with noise, greatly improving the computational efficiency.

Asynchronous Inference: While the robot is executing the current action, the model is already calculating the next step in the background. Inference and execution are parallel, and the sense of delay is almost eliminated.

FDM Grounding: To prevent the model's imagination from deviating from reality, the system continuously corrects the imagination with real observation data to avoid open - ended hallucination drift.

Experimental results and capability verification

After understanding the theory, let's look at the experimental results.

The LingBot by Ant team has conducted comprehensive real - machine and simulation benchmark tests on LingBot - VA.

In the real - machine tests, LingBot - VA covered three types of the most challenging tasks.

First are long - sequence tasks, such as preparing breakfast (toasting bread, pouring water, setting the table) and unpacking a package (picking up a knife, cutting the box, opening the lid).

These tasks involve many steps. If there is a mistake in any step, it's all over. From LingBot - VA's performance, it can be described with one word: stable.

Even if it fails accidentally, the robot will remember the progress and try again.

The second type is high - precision tasks, such as cleaning test tubes and screwing screws.

This requires action precision at the millimeter level. Thanks to the MoT architecture, the action stream is no longer disturbed by visual noise, and the hands are extremely stable.

We've already seen the example of cleaning test tubes. Now, let's look at an example of screwing screws:

The third type of tasks is for deformable objects, such as folding clothes and trousers.

The difficulty of these tasks lies in the fact that the objects are in a constantly changing state. However, LingBot - VA anticipates the deformation of the fabric through video simulation and operates smoothly.

In addition, LingBot - VA also performs well on the two hardcore simulation benchmarks, RoboTwin 2.0 and LIBERO.

Especially in the dual - arm cooperation tasks of RoboTwin 2.0, whether it is a simple fixed scenario (Easy) or a complex random scenario (Hard), LingBot - VA shows good capabilities:

RoboTwin 2.0 (Easy): The success rate is 92.93%, 4.2% higher than the second - ranked model.

RoboTwin 2.0 (Hard): The success rate is 91.55%, 4.6% higher than the second - ranked model.

Moreover, there is a very obvious trend:

The more difficult the task and the longer the sequence (the larger the Horizon), the greater LingBot - VA's leading advantage.

In long tasks with Horizon = 3, its advantage even expands to over 9%.

In the LIBERO benchmark test, LingBot - VA achieved an average success rate of 98.5%, breaking the SOTA record.

In summary, through these experiments, we can clearly see three core characteristics of LingBot - VA:

Long - term memory: In a counting task of wiping a plate back and forth, an ordinary VLA model forgets how many times it has wiped and starts to wipe randomly. In contrast, LingBot - VA counts accurately and stops after finishing. This is the effect of the KV - cache.

Few - shot adaptation: Facing a brand - new task, it can learn with only about 50 demonstration data and a slight fine - tuning. This is several orders of magnitude more efficient than models that require tens of thousands of data.

Generalization ability: If it is trained with a certain type of cup, it can still accurately recognize and operate on a cup of a different shape, color, or position during testing.

Four consecutive days of open - source releases have had an impact

Looking back at these four consecutive days of open - source releases, we can see that LingBot by Ant has a grand plan.

Because these four open - source projects together form a very clear technological mainline:

Day 1: LingBot - Depth — Solve the problem of "seeing clearly". Make perception clearer.

Day 2: LingBot - VLA — Solve the problem of "connection". Connect the general interfaces from language, vision to action.

Day 3: LingBot - World — Solve the problem of "understanding". Build a predictable and imaginable world model.

Day 4: LingBot - VA — Solve the problem of "action". Embed the world model into the control loop and let imagination guide action.

These four pieces of the puzzle together send a strong signal:

General robots are fully moving towards the video era.

Video is no longer just a data material for training. It is becoming a medium for inference and a unified representation connecting perception, memory, physics, and action.

This is of great value to the entire industry.

For general robots, long - term tasks, complex scenarios, and unstructured environments, which were once major weaknesses, now have systematic solutions.

From the perspective of embodied intelligence, the world model is no longer an optional feature. It has officially become the central ability of robots, evolving from "being able to move" to "thinking before moving".

Moreover, LingBot by Ant's continuous open - source actions not only provide codes and models but also present a reproducible and extensible technological paradigm.

The butterfly effect is also beginning to show in the industry.

In the past few days, Google announced that more people can experience Genie 3 through Project Genie; Unitree Technology announced the open - source of UnifoLM - VLA - 0...

Overseas media have also paid considerable attention to LingBot by Ant's open - source actions and commented:

Ant Group has released a high - quality robot AI simulation environment called LingBot - World. This Chinese fintech company has completed a complete set of open - source toolkits for the development of physical AI systems. This is also a strategic move in the global competition for leadership in the robot field.

Well, LingBot by Ant has put enough pressure on the industry.

All in all, the emergence of LingBot - VA marks the first time that the world model has truly taken the main stage of robot control.

Project address: https://technology.robbyant.com/lingbot - va

GitHub address: https://github.com/robbyant/lingbot - va

Project weights: https://huggingface.co/robbyant/lingbot - vahttps://www.modelscope.cn/collections/Robbyant/LingBot - va

This article is from the WeChat official account "QbitAI", author: Focusing on

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Oh no! Robots have learned to predict the future.

Let imagination take the lead

Experimental results and capability verification

Four consecutive days of open - source releases have had an impact