A model that enables robots to learn about the world through "imagination" is here. It's jointly produced by the research group of PI's co-founder and the team led by CHEN Jianyu from Tsinghua University.
In the past two days, Chelsea Finn, the co-founder of Physical Intelligence (PI), has been continuously praising a latest world model work by a Stanford research group on X.
It's easy to generate videos that look good, but the hard part is building a truly useful general model for robots – it needs to closely follow actions and be accurate enough to avoid frequent hallucinations.
This research is the Controllable Generative World Model Ctrl-World jointly proposed by the research group she leads at Stanford and the team of Chen Jianyu from Tsinghua University.
This is a breakthrough solution that enables robots to perform task rehearsals, strategy evaluations, and self-iterations in an "imaginary space".
Core data shows that the model uses zero real-world robot data and significantly improves the instruction-following ability of strategies in some downstream tasks. The success rate has increased from 38.7% to 83.4%, with an average improvement of 44.7%.
The related paper "CTRL-WORLD: A CONTROLLABLE GENERATIVE WORLD MODEL FOR ROBOT MANIPULATION" has been published on the arXiv platform.
Note: Ctrl-World is designed for in-loop trajectory inference of general robot strategies. It generates joint multi-view predictions (including the wrist view), achieves fine-grained motion control through frame-level conditional control, and maintains coherent long-term dynamics through pose-conditioned memory retrieval. These components enable: (1) accurate strategy evaluation in imagination and alignment with real-world trajectory inference; (2) targeted strategy improvement through synthetic trajectories.
Research Background: The "Real-World Dilemma" of Robot Training and the Value of World Models in Breaking the Deadlock
Currently, although Vision-Language-Action (VLA) models have demonstrated excellent performance in various manipulation tasks and scenarios, they still face two core challenges in open-world scenarios. These are also the core motivations for the team to develop CTRL-WORLD:
Challenge 1: High cost of strategy evaluation. Real-world testing is expensive and inefficient.
Verifying the performance of robot strategies requires repeated trial and error in different scenarios and tasks.
Take the "object grasping" task as an example. Researchers need to prepare objects of different sizes, materials, and shapes, and set up environments with different lighting and table textures. The robot then needs to perform the operation hundreds or thousands of times.
Moreover, problems such as robotic arm collisions (with a failure rate of about 5% - 8%) and object damage (with a single-round testing loss cost exceeding 1,000 yuan) may occur during testing. The evaluation cycle for a single strategy often takes several days. More importantly, sampling tests cannot cover all potential scenarios, making it difficult to fully expose strategy defects.
Challenge 2: Difficulty in strategy iteration. Real-world scenario data is always insufficient.
Even the mainstream model π₀.₅ trained on the DROID dataset containing 95k trajectories and 564 scenarios has a success rate of only 38.7% when facing unfamiliar instructions such as "grasp the object in the upper-left corner" or "fold the patterned towel" or unseen objects such as "gloves" and "staplers".
Traditional improvement methods rely on human experts to annotate new data. However, the annotation speed is far behind the scenario update speed. It takes a senior engineer 20 hours to annotate 100 high-quality towel-folding trajectories, with a cost exceeding 10,000 yuan, and it cannot cover all irregular objects and instruction variations.
While there are still thorny problems in the open world, on the other hand, traditional world models currently also face three major pain points –
To address the dependence on the real world, the academic community has tried to use world models (i.e., virtual simulators) to train robots in imagination.
However, the research team pointed out in the paper "CTRL-WORLD: A CONTROLLABLE GENERATIVE WORLD MODEL FOR ROBOT MANIPULATION" that most existing world model methods focus on passive video prediction scenarios and cannot actively interact with advanced general strategies.
Specifically, there are three key limitations that hinder their support for policy-in-the-loop inference:
- Single-view leads to hallucinations
Most models only simulate a single third-person view, resulting in the "partial observability problem". For example, when the robotic arm grasps an object, the model cannot see the contact state between the wrist and the object, and hallucinations such as "the object teleports into the gripper without physical contact" may occur.
- Coarse motion control
Traditional models mostly rely on text or initial image conditions and cannot bind high-frequency, subtle motion signals. For example, the difference between the robotic arm moving "6 cm along the Z-axis" and "4 cm along the Z-axis" cannot be accurately reflected, causing the virtual rehearsal to deviate from the real motion.
- Poor long-term consistency
As the prediction time extends, small errors will accumulate continuously, leading to "temporal drift". The paper's experiments show that after a 10-second rehearsal, the deviation of the object's position from real physical laws in traditional models makes it lose its reference value.
To address these issues, the two teams of Chen Jianyu from Tsinghua University and Chelsea Finn from Stanford University jointly proposed CTRL-WORLD, aiming to build a virtual training space for robots that can "accurately simulate, be stable in the long term, and align with the real world", allowing robots to train through "imagination".
Three Innovative Technologies Enable Ctrl-World to Break Through the Limitations of Traditional World Models
Ctrl-World solves the pain points of traditional world models through three targeted designs, achieving a "high-fidelity, controllable, and long-coherent" virtual rehearsal.
The paper emphasizes that these three innovations together transform the "passive video generation model" into a "simulator that can interact with VLA strategies in a closed loop".
Ctrl-World is initialized based on a pre-trained video diffusion model and adapted into a controllable and time-consistent world model in the following ways:
Multi-view input and joint prediction
Frame-level motion conditional control
Pose-conditioned memory retrieval
First, Multi-view Joint Prediction: Solve the "Blind Spot" and Reduce Hallucination Rate
Generally, previous models relied on single-view prediction, resulting in partial observability problems and hallucinations.
In contrast, Ctrl-World combines the third-person and wrist views for joint prediction, generating future trajectories that are accurate and consistent with the real situation.
Traditional world models only simulate a single third-party view, which is essentially "lack of information".
In contrast, CTRL-WORLD innovatively jointly generates a third-party global view + a first-person wrist view:
The third-party view provides global environmental information (such as the overall layout of objects on the table), and the wrist view captures contact details (such as the friction between the robotic gripper and the towel and the collision position with the drawer);
The model stitches multi-view image tokens through a spatial Transformer (a single frame contains three 192×320 images, encoded into 24×40 latent features) to achieve cross-view spatial relationship alignment.
The paper's experiments verify the value of this design:
In fine manipulation tasks involving the contact between the robotic arm and objects (such as grasping small objects), the wrist view can accurately capture the contact state between the gripper and the object (such as the pinching force and contact position), significantly reducing the "hallucination of grasping without physical contact".
Quantitative data shows that this design reduces the object interaction hallucination rate. In multi-view evaluation, the Peak Signal-to-Noise Ratio (PSNR) of Ctrl-World reaches 23.56, far exceeding the traditional single-view models WPE (20.33) and IRASim (21.36). The Structural Similarity Index (SSIM) of 0.828 is also significantly higher than the baselines (WPE 0.772, IRASim 0.774), proving the high consistency between the virtual image and the real scenario.
Second, Frame-level Motion Control: Bind the Causal Relationship between Motion and Vision to Achieve Centimeter-level Precise Manipulation
To make the virtual rehearsal "controllable", a strong causal relationship between "motion" and "vision" must be established.
Ctrl-World's solution is "frame-level motion binding":
Convert the action sequence output by the robot (such as joint velocity) into the pose parameters of the robotic arm in Cartesian space;
Through the frame-level cross-attention module, ensure that the visual prediction of each frame is strictly aligned with the corresponding pose parameters – just like a "storyboard" corresponding to each scene in a drama, ensuring that "action A necessarily leads to visual result B".
Note: The above figure shows the controllability of Ctrl-World and its ablation experiments. Different action sequences can produce different unfolding results with centimeter-level accuracy in Ctrl-World. Removing the memory will cause the prediction to be blurred (blue), while removing the frame-level pose condition will reduce the control accuracy (purple). The attention visualization (left) shows strong attention to the frame at (t = 0) seconds with the same pose when predicting the frame at (t = 4) seconds, indicating the effectiveness of memory retrieval. For clarity, each action block is expressed in natural language (e.g., "Z-axis -6 cm"). Due to space limitations, only the wrist view of the middle frame is visualized.
The paper provides intuitive examples:
When the robotic arm performs different spatial displacements or pose adjustments (such as centimeter-level movements along a specific axis or opening and closing of the gripper), Ctrl-World can generate rehearsal trajectories that strictly correspond to the actions. Even subtle action differences (such as a few centimeters of displacement change) can be accurately distinguished and simulated.
Quantitative ablation experiments show that if the "frame-level motion condition" is removed, the PSNR of the model will drop from 23.56 to 21.20, and the LPIPS (perceptual similarity, the lower the better) will increase from 0.091 to 0.109, proving that this design is the core of precise control.
Third, Pose-conditioned Memory Retrieval: Install a "Stabilizer" for Long-term Simulation to Prevent Drift in 20-second Long-term Rehearsals
The "temporal drift" in long-term rehearsals is essentially the model "forgetting the historical state".
Ctrl-World introduces a "pose-conditioned memory retrieval mechanism" to solve this problem through two key steps:
Sparse memory sampling: Sample k frames (k = 7 in the paper) from the historical trajectory at a fixed step size (e.g., 1 - 2 seconds) to avoid the computational burden caused by an overly long context;
Pose-anchored retrieval: Embed the pose information of the robotic arm in the sampled frames into visual tokens. When predicting a new frame, the model will automatically retrieve "historical frames with similar poses to the current one" to calibrate the current prediction based on the historical state and avoid drift.
Note: The above figure shows the consistency of Ctrl-World. Since the field of view of the wrist camera changes significantly in a single trajectory, using multi-view information and memory retrieval is crucial for generating consistent wrist view predictions. The predictions highlighted in the green box are inferred from other camera views, while the predictions in the red box are retrieved from memory.