HomeArticle

The real - machine RL has gone crazy. The robot self - learned for 20 minutes and scored 100 points. Digital twin is a legend.

新智元2026-02-13 15:29
Behind this, there is a rapidly rising embodied intelligence company called Zhijian Dynamics.

[Introduction] TwinRL constructs a digital twin by scanning the scene once with a mobile phone, allowing robots to boldly explore and accurately test errors in the digital twin first. Then, when returning to the real machine, it can cover the entire desktop with a 100% success rate in 20 minutes - 30% faster than existing methods and reducing human intervention by more than half.

What happened when the robot truly "stepped out of the demonstration data"?

You spent two weeks remotely operating a robotic arm step - by - step to teach it to pick up a banana and place it on a plate. On the left - hand side of the table, it learned quite well and seemed very confident.

Then you moved the banana 15 centimeters to the right.

The robotic arm froze.

It's not that it "didn't learn well", but rather that it had never seen that position before.

For it, the right - hand side of the table is another universe.

This is not a joke. This is the real situation of almost all VLA models in the real world in 2025.

In the past two years, Vision - Language - Action (VLA) models have swept through the robotics field.

From "seeing pictures + listening to instructions + taking actions" to the generalized execution of multi - tasks and multi - scenarios, VLA makes robots look like intelligent agents that "understand the world" for the first time.

In papers, the success rate often exceeds 90%, and the demonstration videos are extremely well - shot.

But those who have actually conducted real - machine experiments know that there is a problem that everyone is well aware of but few people answer directly:

Can robots learn on their own without continuous human demonstrations?

The answer is - almost no.

The cruelty of reality lies in:

  • Human demonstration (Teleoperation) is expensive, inefficient, and has limited coverage - One person operating a joystick for a whole day can only cover a small area of the desktop.
  • Online Reinforcement Learning (RL) on real robots is slow, dangerous, and consumes a large amount of resources - A single exploration mistake by a robotic arm may directly damage the sensor.

But these are not the most fatal issues.

The most fatal one is -

The exploration space of RL is firmly locked by SFT demonstration data.

Even if you give the robot more rewards, it will only circle around the "area near the demonstration data".

It's like a person who has only walked around the neighborhood. When you tell him to "explore the world", he will still end up back downstairs after a turn.

Exploration simply doesn't happen.

This problem has been avoided for too long.

It wasn't until TwinRL that it was torn open and put on the table for the first time.

Recently, Zhijian Dynamics, the State Key Laboratory of Multimedia Information Processing at the School of Computer Science of Peking University, Tsinghua University, and the Hong Kong University of Science and Technology proposed a digital twin - collaborative reinforcement learning framework for real - world robot operations, TwinRL (Digital Twin - Driven Reinforcement Learning), which can efficiently execute online reinforcement learning directly on real robots and systematically expand the exploration space.

According to industry insiders, Zhijian Dynamics' current valuation is approaching the unicorn camp. It is extremely rare in the entire embodied intelligence track to receive such intensive support from top - tier capital just six months after its establishment.

The core insight of TwinRL: The problem with RL is not the inability to learn, but the restricted exploration space.

Through systematic real - robot experiments, the TwinRL team observed a key phenomenon:

In the real world, the effective exploration space of VLA is almost completely determined by the SFT data distribution.

What does this mean?

  • RL is more like "re - weighting" rather than "blazing new trails".
  • The Out - of - Distribution (OOD) area is almost unreachable for the SFT model.
  • Even with Human - in - the - Loop, it only slowly "moves the boundaries".

The problem lies not in the algorithm, but in the exploration space itself.

So, a bold idea emerged:

If parallel exploration is not possible in the real world, then move the "exploration" to a "controllable and scalable world" in advance.

This world is the Digital Twin.

TwinRL: Not a "simulator" but an exploration amplifier and guide

Different from the traditional "simulation + real2sim", the Digital Twin is not used to replace the real world, but to "amplify the real - world exploration ability".

TwinRL constructs a digital twin - real robot collaborative reinforcement learning framework, the core of which consists of three steps:

1. Exploration Space Expansion

  • Use a mobile phone to shoot the real scene.
  • Efficiently reconstruct a high - fidelity digital twin based on 3D Gaussian Splatting.
  • Generate synthetic trajectories that far exceed the coverage of human demonstrations in the twin environment.
  • Explicitly broaden the data distribution support during the SFT stage.

It's not about "learning better", but about starting in a larger world from the beginning.

2. Parallel Online RL in the Digital Twin

Real robots cannot conduct parallel trial - and - error, but the digital twin can.

Before deployment, TwinRL:

  • Efficiently and parallelly execute online RL in the digital twin.
  • Generate high - quality exploration trajectories in the RL style, bridging offline → online.

This step greatly alleviates the cold - start and instability problems of RL in the real world.

3. Sim - to - Real Guided Human - in - the - Loop Exploration

The digital twin is not only "numerous" but also "accurate".

TwinRL will:

  • Automatically identify configurations with a high failure rate but dense information in the twin environment.
  • Accurately guide humans to intervene only in the "most valuable positions".
  • Significantly reduce ineffective demonstrations and repeated operations.

Humans are no longer laborers but strategic guides.

Digital Twin - Collaborative Reinforcement Learning Framework TwinRL

Different from previous methods that only achieve a high success rate in a single initial configuration, TwinRL does not achieve "100% in one point", but covers a 100% success rate in a wider workspace range (including the out - of - distribution OOD area).

In four real - world operation tasks, TwinRL only needs about 20 minutes on average to complete the convergence of online reinforcement learning, achieving at least a 30% acceleration compared with existing real - robot RL methods, and significantly reducing the need for human intervention.

In addition, even under the conditions of object position perturbation and environmental changes, TwinRL can still maintain stable performance, demonstrating stronger spatial generalization and exploration abilities.

Paper link: https://arxiv.org/abs/2602.09023

Project homepage: https://sites.google.com/view/twinrl/twinrl

1. Abstract

Although Vision - Language - Action (VLA) models have shown good generalization ability in robot operation tasks, their application in the real world is still restricted by high - cost human demonstration data and limited real - world interactions.

Online Reinforcement Learning (RL) provides an effective way to improve model capabilities based on environmental feedback. However, in real - robot scenarios, its exploration efficiency and scalability are still significantly limited.

Through systematic real - robot experiments, the research team found that the effective exploration space of online reinforcement learning in the real world is highly correlated with the data distribution used in the Supervised Fine - Tuning (SFT) stage.

In this context, this paper proposes a digital twin - real robot collaborative reinforcement learning framework TwinRL, aiming to systematically expand and guide the exploration process of VLA models.

TwinRL first uses the real - scene data collected by mobile phones to efficiently reconstruct a high - fidelity digital twin environment, realizing two - way migration between the real world and the simulation environment.

During the supervised fine - tuning stage, the framework introduces an exploration space expansion strategy through the digital twin to explicitly broaden the support range of trajectory data distribution.

On this basis, TwinRL further proposes a sim - to - real guided exploration mechanism, which efficiently and parallelly executes online reinforcement learning in the digital twin environment before deployment, thereby effectively connecting the offline training and the real - world online learning processes.

In addition, the framework also uses efficient sampling in the digital twin to identify key configurations with a high failure rate but dense information, which are used to guide the targeted human - in - the - loop exploration on real robots.

The experimental results on multiple real - world robot operation tasks show that TwinRL has achieved stable performance improvements in both the demonstration - data - covered area and the out - of - distribution area. While significantly reducing human intervention, it shortens the convergence time of online reinforcement learning on real robots to about 20 minutes and achieves at least a 30% efficiency improvement compared with existing methods.

Figure 1: Overall framework (a)

2. Research Background

Vision - Language - Action (VLA) models have shown good generalization potential in robot operation tasks in recent years, being able to directly map natural language instructions to continuous control behaviors.

However, existing VLA methods still highly rely on manual demonstration data (teleoperation) in real - world deployment. The acquisition of such data is costly, has limited coverage, and is difficult to support long - term autonomous learning.

Reinforcement Learning (RL) is considered an important means to break through the bottleneck of demonstration data. However, directly applying online RL in real - robot systems faces real - world constraints such as low efficiency, high risk, and difficulty in parallelization.

Especially in complex physical environments, the robot's exploration space is strongly restricted by the initial supervised data distribution, making it difficult for online learning to effectively expand to uncovered areas.

3. Core Observations and Research Motivation

Figure 2: Exploration bottleneck.

Although online reinforcement learning (online RL) provides an exploration path to improve task robustness, its sample efficiency on real physical hardware still faces challenges.

Inspired by research in the general domain, we observed that in real - world VLA reinforcement learning, the exploration process is actually strictly constrained by the trajectory distribution space support induced by the Supervised Fine - Tuning (SFT) stage.

This constraint brings double bottlenecks: (1) It limits the set of states that the policy can reliably explore; (2) Even if human intervention is introduced, it significantly reduces the learning efficiency of online RL.

Experimental setup.

As shown in the figure, we conducted experiments on a high - precision block - insertion task, which has high requirements for spatial position accuracy. All policies are based on the Octo model. We divided the workspace into an in - distribution area A (covered by demonstration data) and an out - of - distribution area B (not observed during the SFT stage).

Bottleneck one. We analyzed the impact of changing the spatial coverage of SFT demonstrations on the policy generalization ability and autonomous online RL. Specifically, we compared two training data distributions: A - only, using only 30 demonstrations from area A; A + B, on this basis, adding 30 digital twin demonstrations from area B. To measure how demonstration coverage shapes the exploration space, we initialized the policy as an A - only SFT model and executed autonomous online RL in the unseen area B.

Finding one. As shown in the figure, 10 rollouts were performed in each grid cell. In area B, the success rate of the A + B policy reached 62.5%, while the A - only policy was completely confined to area A (the success rate in area B was 0%). This indicates that the extrapolation ability of the standard SFT policy in spatially uncovered areas is extremely limited. More importantly, when performing autonomous online RL in area B starting from the A - only model, an obvious exploration dead - lock phenomenon occurred. Under the OOD initial configuration, even after 40K training steps (about two hours), the policy still could not stably obtain positive rewards. This phenomenon is consistent with the observations in previous work: the replay buffer is dominated by failed trajectories, resulting in almost ineffective autonomous adaptation. The results show that the effective exploration space of online RL is highly correlated with the spatial coverage of SFT data.

Bottleneck two. To alleviate the exploration dead - lock, Human - in - the - Loop (HiL) intervention can be introduced to guide the robot to complete the task. However, the key question is: Can efficient online adaptation be guaranteed in OOD scenarios when human guidance is available? To this end, we compared two settings: in - distribution post - training (performing online RL in area A) and out - of - distribution post - training (performing online RL in area B). All models were initialized from the same A - only SFT policy.

Finding two. Although both settings can obtain successful corrective demonstrations under human intervention, the sample efficiency difference is significant. As shown in the figure, the in - distribution post - training adapts quickly, and the success rate exceeds 90% in about 45 minutes (about 14K interaction steps); in contrast, the out - of - distribution post - training converges more slowly and is more unstable, and fails to achieve comparable performance under the same interaction budget. These results indicate that even if the HiL mechanism is introduced, learning in the unseen area B is still difficult. This is mainly due to the unfavorable reward landscape and the imbalanced data distribution in the replay buffer, which significantly reduce the gradient efficiency.

Conclusion. The above observations indicate that to break through the two bottlenecks, it is necessary to expand the exploration coverage before real - world interactions and systematically guide human intervention during the online stage to improve learning efficiency. Based on this, we propose TwinRL - a reinforcement learning framework that collaborates between the digital twin and real robots, using the digital twin as an exploration amplifier and guide throughout the SFT and online RL stages.

4. TwinRL Framework Overview

The entire framework consists of three closely coupled stages: exploration space expansion, digital twin parallel online reinforcement learning, and sim - to - real guided real - world exploration.

Exploration space expansion strategy. First, we constructed a high - fidelity digital twin environment. By collecting real - world scenes with a mobile phone and reconstructing them based on 3D Gaussian Splatting, we achieved geometric and visual consistency between the real environment and the simulation environment. Based on this twin environment, we introduced an exploration space expansion strategy during the Supervised Fine - Tuning (SFT) warm - up stage, generating trajectory data covering a wider