"Ready Player One" Scenario Turns into Reality: NTU Unveils New Paradigm for World Model Interaction, Solving Active Operation Problem

Hand2World: Gesture-controlled AI-generated interactive videos to achieve closed-loop interaction.

[Introduction] The MMLab team at Nanyang Technological University has launched Hand2World, enabling the AI world model to truly "reach out" and interact. Simply gesture in the air, and the model can generate realistic first-person interaction videos, responding and adjusting in real-time. It discards the old occlusion misguidance, decouples hand and head movements using 3D hand structure and ray encoding, and for the first time achieves closed-loop continuous interaction. Based on fully automatic annotation of monocular videos, it paves the way for AR and robot interaction. The world model is no longer just "seeing" but can "touch".

Sora can generate a realistic visual world, and Genie 3 allows you to freely explore in a 3D scene - but you can only "look" and can't reach in to grab the cup on the table.

Current world models already have "eyes" and "legs" - they can perceive the environment and move the perspective, but they always lack a pair of "hands".

They can see and move but cannot interact. This is the last hurdle for the world model to move from passive observation to active control. And the most primitive interface for humans to interact with the physical world is gestures.

The MMLab team at Nanyang Technological University proposed Hand2World[1] - given a scene photo, users only need to make gesture movements in the air, and the AI can generate realistic first-person videos of hands reaching into the scene to grab a cup, turn a book, or open a box. Moreover, this is not a one-time generation: users can adjust their gestures while watching the generated results, and the model follows up in real-time - forming a true closed-loop interaction.

Paper link: https://arxiv.org/abs/2602.09600

Project homepage: https://hand2world.github.io

Why can't existing methods solve the problem?

Imagine you train an AI and let it watch tens of thousands of videos of hands grabbing cups. Now show it a hand waving in the air - it will be at a loss. Because in the training data, the hand is always half-blocked by the cup or the book, and the AI has never seen what a "complete hand" looks like. As a result, when faced with a complete hand shape, it creates non-existent occlusions out of thin air.

This is the fatal flaw of all methods based on 2D hand masks - they see a truncated hand during training but receive a complete hand during inference, resulting in a direct mismatch in the distribution. The following figure clearly shows this: in the upper row of training scenes, the mask is truncated by objects, while in the lower row of freehand gestures, the mask is complete. Existing methods (such as CosHand) therefore produce severe artifacts.

Mask distribution mismatch vs. Hand2World's occlusion-invariant conditional signal

To make matters worse, in first-person videos, hand movements and the wearer's head rotations are completely entangled in the frame - the model can't tell "whether the hand is moving or the head is moving", and the background will drift along with the hand.

Recently, there have also been efforts to advance first-person world models - for example, PlayerOne[2] made important progress by synchronously pairing first-person and third-person cameras to model self-motion.

However, this approach not only limits the scalability of the data but also restricts practical applications. Can we solve all the above problems starting from only monocular videos? This is exactly the research starting point of Hand2World.

How does Hand2World achieve it?

Method flowchart

Let the model "see" the complete hand

Hand2World completely abandons the 2D mask. It recovers the complete 3D hand mesh (MANO model) from monocular videos, projects it onto the image plane, and renders it as a composite signal of "filled contour + wireframe overlay". Regardless of whether the hand is occluded by an object, the format of this control signal remains consistent.

Key insight: The occlusion relationship is not hard-coded in the input signal but is left to the generative model to infer based on the scene context. The wireframe overlay can also provide additional joint structure information when the palm faces the camera and the fingers occlude each other - something that a pure contour can't do.

Distinguish between "hand movement" and "head movement"

After removing the camera modeling module, the FVD soared from 218 to 815 - the background started to drift along with the hand.

Hand2World explicitly encodes camera motion using pixel-by-pixel Plücker ray embedding and injects it into the diffusion model in an additive manner through a lightweight adapter. This move completely decouples hand joint movement and head self-motion.

Comparison of camera control ablation. When there is no camera condition (upper row), the background drifts severely. After adding Plücker rays (middle row), it is highly consistent with the real video (lower row).

Closed-loop interaction, infinite continuation

Hand2World distills the bidirectional diffusion teacher model into a causal autoregressive generator, maintains temporal coherence through KV cache, and supports streaming output. This makes the entire system form a closed loop - users can adjust their gestures while watching, and the model continues to respond, allowing the interaction to go on infinitely.

Experimental results, leading in three datasets

It achieved the best results on three first-person interaction datasets: ARCTIC, HOT3D, and HOI4D. Taking ARCTIC as an example:

FVD: 908 → 218 (a decrease of 76%)
Camera trajectory error: 0.13 → 0.07 (a decrease of 42%)
DINO semantic similarity: 0.80 → 0.88
Depth consistency: Depth-ERR decreased from 22.51 to 16.14

After distillation, the performance of Hand2World-AR is close to that of the teacher model (FVD 232), and it reaches 8.9FPS on a single A100 card.

The data flywheel of embodied intelligence: fully automatic single-object annotation

Where does the training data for Hand2World come from? Different from solutions like PlayerOne that rely on multi-camera synchronous collection, the team developed a fully automatic monocular annotation pipeline - no multi-camera array is needed, and no manual annotation is required. It directly extracts hand meshes, camera trajectories, and training data pairs from ordinary first-person videos. This means that any existing egocentric video can be converted into a training signal - providing a truly scalable solution for large-scale data collection in embodied intelligence.

From "seeing the world" to "touching the world"

As a preliminary attempt to introduce gesture interaction into the world model, Hand2World has built a complete system from data annotation to closed-loop generation. With the rapid improvement of video generation capabilities, this system is expected to be applied to gesture interaction in AR/MR glasses, data synthesis for robot hand-object interaction, and the construction of an interactive virtual environment from a single photo.

When the world model no longer just passively generates images but can respond to every gesture of the user and continuously evolve - the distance from "seeing the world" to "touching the world" may be closer than we think.

References:

[1] Wang et al., "Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures," arXiv:2602.09600, 2026.

[2] Tu et al., "PlayerOne: Egocentric World Simulator," Advances in Neural Information Processing Systems (NeurIPS), 2025.

This article is from the WeChat public account "New Intelligence Yuan", edited by LRST. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。