200,000 pieces of 4D interactive data + kinematic anchoring. Nanyang Technological University enables generative simulation to stop "imagining" robot movements.
To train robots at low cost, researchers often rely on simulators to mimic their interactions with the environment.
However, the effectiveness of traditional simulators is limited by rigid physical rules, and those based on emerging video generation models often "imagine" interactions in a 2D space...
To address this issue, MMLab at Nanyang Technological University has built a high-fidelity 4D spatiotemporal training ground for embodied intelligence.
Simulating robot-environment interactions is the core of embodied intelligence. Recently, some studies have demonstrated the potential of using video generation technology to break through the "rigid" visual and physical limitations of traditional simulators. However, these works mainly operate in 2D space or are restricted by the single guidance of static environments, ignoring a fundamental fact: the interaction between robots and the world is essentially a 4D spatiotemporal event that requires precise interaction modeling.
To restore this essence and ensure precise robot control, MMLab at Nanyang Technological University has proposed a brand-new 4D generative embodied simulator - Kinema4D. It redefines generative simulation through the concept of "decoupling control and environment", enabling the model to "perceive" the accurate 4D operation trajectory of the robot and deduce the environmental response. For the first time, it demonstrates the zero-shot generalization potential of generative simulators, opening up a brand-new 4D high-fidelity path for the large-scale training of next-generation embodied intelligence.
Background and Challenges
△
In the field of embodied intelligence, simulating robot trajectories is crucial for large-scale data augmentation, strategy evaluation, and reinforcement learning. However, the high cost and safety risks of real robot deployment make virtual environment simulation an indispensable alternative. Although traditional physical simulators have made significant progress, they are limited by the lack of visual realism and the dependence on preset physical rules, making it difficult to extend to complex new scenarios.
Recently, researchers have begun to use video generation models to synthesize robot-environment interactions, bypassing the cumbersome physical modeling by using actions as conditional prompts.
However, existing generative simulation methods still have key drawbacks:
1. Dimension Missing: Most models are limited to the 2D pixel space, lacking the 4D spatiotemporal constraints required for robot interactions.
2. Insufficient Precision: Most studies rely on high-level language instructions, implicit action understanding, or static environment priors, making the generative models need to "guess" potential robot actions. It is difficult to provide the precise control and dynamic guidance required for high-fidelity modeling, resulting in poor performance when dealing with complex situations such as deformation or occlusion.
3. Summary: As shown in Figure 1, existing methods struggle to balance the three challenges of dynamic guidance, operational precision, and spatiotemporal perception simultaneously. To address this, this paper proposes Kinema4D, which anchors abstract actions in 4D space through kinematics, guiding the generative model to achieve reliable generation of complex dynamic interactions while ensuring precision and spatiotemporal perception.
Core Method
△
As shown in Figure 2, the core motivation of Kinema4D is to restore the 4D spatiotemporal essence of the interaction process while ensuring precise robot control. Based on the design philosophy of "simulation decoupling", the interaction process is decomposed into robot control and the resulting environmental changes, supported by the following two synergistic insights:
i) Precise 4D Action Representation Driven by Kinematics: Robot actions have physical determinacy in 4D space and should not be "predicted" or "guessed" by the generative model. Abstract joint angles or pose sequences only make sense when mapped to the physical structure. Therefore, Kinema4D uses the 3D reconstructed URDF model to generate continuous and physically accurate 4D trajectories through explicit kinematics, providing high-granularity spatiotemporal causal drive for the interaction.
ii) 4D Modeling of Environmental Response under Controllable Generation: Different from the definite robot control, complex environmental dynamics require highly flexible generative modeling. Kinema4D projects the derived 4D robot trajectory into a spatiotemporal point map (Pointmap) signal to guide the generative model to get rid of the burden of modeling the robot's own kinematics and instead focus on synthesizing the reactive dynamics of the environment.
By synchronously predicting RGB and point map sequences, Kinema4D transforms the simulation into a spatiotemporal reasoning task in a unified 4D space, not only achieving visual realism but also ensuring geometric consistency.
Dataset
△
A large-scale dataset is the cornerstone of training world models. Therefore, as shown in Figure 3, this paper constructs Robo4D-200k - the largest 4D robot interaction dataset to date.
This dataset lays a solid data foundation by integrating diverse real-world demonstration data such as DROID, Bridge, and RT-1. At the same time, it introduces LIBERO simulation data to synthesize a large number of successful and failed cases. Each sequence completely records an interaction process between a robot and the world (such as "grasping and placing"), providing the model with continuous spatiotemporal information required for robust reasoning. Robo4D-200k contains 201,426 high-fidelity interaction sequences. With its large data volume and interaction diversity, it makes it possible to train an embodied foundation model with spatiotemporal and physical perception capabilities.
Experimental Analysis
The paper conducts a comprehensive benchmark test on the proposed method from three dimensions: video generation quality, geometric quality, and downstream strategy evaluation:
Regarding video generation quality, Kinema4D has achieved leading results, as shown in Table 1. Its visualization results are shown in Figure 2. Compared with Ctrl-World [ICLR 2026], Kinema4D can better restore robot actions and obtain environmental response results similar to the ground truth (GT).
△
△
Regarding geometric quality, compared with another recent 4D generative simulator (TesserAct [ICCV 2025]), Kinema4D has also achieved better results, as shown in Table 2. Its visualization results are shown in Figure 3. Kinema4D can accurately restore the execution effect of the real trajectory (Ground-Truth), including cases where the robot task fails by a "hair's breadth". For example, in the lower-left example, even when the RGB textures of the gripper and the plant overlap in the 2D view, Kinema4D can still accurately identify the spatial gap between them, thus accurately simulating the result that the robotic arm fails to grasp the plant.
△
△
The paper also explores the utility of Kinema4D as a high-fidelity tool in robot strategy evaluation - that is, whether the simulator can accurately simulate the real results after executing the strategy trajectory (Rollout). The evaluation is deployed in two scenarios: a standardized simulation platform (noise-free environment) and the real world (complex physical environment).
△
△
As shown in Figures 6 and 7, the simulation results of Kinema4D are highly consistent with the actual execution performance, being able to accurately synthesize successful execution trajectories (Rollouts) and cases where the task fails by a "hair's breadth". In the figures, even when the RGB textures of the gripper and the object overlap in the 2D view, our model can still accurately identify the spatial gap between them.
It is worth noting that for the real-world strategy evaluation experiment, Kinema4D was not fine-tuned on any real-world data; the physical environment used for testing is completely out-of-distribution (OOD) for the model. This is the first time that an embodied generative world model has demonstrated certain generalization potential under strict OOD conditions.
Summary and Outlook
Kinema4D marks a leap in the robot simulation paradigm from traditional 2D pixel generation to 4D spatiotemporal reasoning. Through the original decoupling framework of "kinematic anchoring" and "generative evolution", it successfully integrates definite mechanical control with flexible environmental feedback.
Experiments have proven that Kinema4D can not only bridge the gap between the virtual and the real but also demonstrate strong zero-shot generalization ability. It paves a brand-new 4D path for building a high-fidelity and scalable embodied intelligence training ground.
In addition, in the face of the challenges of conservation laws in extreme physical scenarios, how to deeply inject explicit physical laws (such as mass, friction, and collision dynamics) into the generative network will be a direction worthy of exploration.
The first author of this paper is Mu-tian Xu, a postdoctoral fellow at MMLab, Nanyang Technological University. The supervisor, Prof. Zi-wei Liu, is the corresponding author of this paper.
Paper Title: Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation
Paper Link: https://arxiv.org/abs/2603.16669
Project Homepage: https://mutianxu.github.io/Kinema4D-project-page/
Open Source Code: https://github.com/mutianxu/Kinema4D
This article is from the WeChat official account "QbitAI", author: Fei Yang. Republished by 36Kr with permission.