No need for multiple perspectives, a single image can reconstruct an interactive 3D model. Nanyang Technological University open-sources a structural reasoning framework.
Bring 3D models to life! A team from Nanyang Technological University has proposed MonoArt, which enables the generation of movable 3D models from a single image through step-by-step reasoning. This method first restores the geometric structure, then identifies the components, and finally infers the motion mode and parameters. Without the need for external data or prior knowledge, it can construct a 3D representation with motion capabilities, effectively improving the stability and practicality of reconstruction.
In the field of 3D generation, we are already accustomed to generating 3D object models from a single image.
However, with the explosion of Embodied AI, a new reality has emerged before researchers: most of these models are static assets that are difficult to interact with.
Do you want to open the door of the generated refrigerator? It's welded shut. Do you want the robot to move the generated chair? It doesn't know where it can be folded.
Recently, a research team from the S-Lab of Nanyang Technological University has proposed MonoArt to try to solve this problem efficiently: instead of letting the model directly "guess" how the object moves, it's better to let it "understand" the object's structure step by step first.
The core idea of MonoArt can be summarized in one sentence: model the monocular reconstruction of movable objects as a progressive structural reasoning process.
In this framework, the model doesn't output the articulation all at once. Instead, it sequentially completes geometric restoration, component perception, motion reasoning, and kinematic parameter estimation, and finally obtains a 3D representation with both shape and component hierarchy and joint information.
Paper link: https://arxiv.org/abs/2603.19231
Project link: https://lihaitian.com/MonoArt/
GitHub link: https://github.com/Quest4Science/MonoArt
Introduction
Different from static 3D reconstruction, articulated 3D reconstruction not only needs to restore the object's shape but also further model the component division, joint type, motion axis, rotation center, and motion range. The difficulty of this task lies not only in the need to predict more parameters but also in the fact that the structure and motion are coupled: it's difficult to infer how the movable component moves without knowing how it's divided; conversely, it's also difficult to truly build the structure of the movable component without understanding the motion relationship. That's why directly regressing the articulation from image features is often unstable and has limited generalization.
Existing methods can be roughly divided into three categories:
- Methods based on multi - view or video: These methods rely on observations of the same object in different open - closed states. Although they have good effects, they have high requirements for data conditions, which are often not available in real scenarios.
- Methods based on retrieval and assembly: These methods assemble movable objects through existing asset libraries, but they are easily limited by the shape coverage in the library, and the results often have geometric errors and texture mismatches.
- Methods based on additional prior knowledge: These methods infer the articulation by means of visual - language models, auxiliary video generation, or predefined motion directions. Although they reduce the dependence on multi - view data, the system is more complex, more dependent on external prior knowledge, and usually requires a longer reasoning time.
These methods have a common problem: they don't really take structure understanding itself as the starting point of articulation inference.
They either rely on more observations to supplement information or rely on external prior knowledge to supplement clues, but they don't answer a more fundamental question: can a movable object in a single image be first disassembled into a stable geometric and component structure and then infer the motion relationship on this basis?
MonoArt is proposed to solve this problem. It no longer regards articulation as a direct regression result but models the monocular reconstruction of movable objects as a progressive structural reasoning process, putting geometry, part structure, and motion into the same continuous reasoning chain, making motion a natural result of structure understanding.
Method Design
Specifically, MonoArt consists of four key modules to achieve step - by - step reasoning from image → geometric restoration → component perception → motion reasoning → kinematic parameter estimation.
Step 1: Get a reliable 3D shape first
The starting point of everything is to restore the 3D geometry of the object from a single image. MonoArt uses TRELLIS as a frozen 3D generation backbone to output a canonical mesh and the corresponding aligned latent features. The significance of this step is that all subsequent reasoning about "components" and "motion" is based on the 3D space rather than the 2D image - this is much more stable than directly regressing joint parameters from pixel features.
Step 2: Know which movable components the object consists of
After getting the 3D shape, the next question is: which parts of this shape are movable? The door and the cabinet body of a cabinet are two different components, but the mesh itself won't tell you this. The role of the Part - Aware Semantic Reasoner is to let the model "understand" the component structure.
It projects the geometric features of each point on the surface onto three orthogonal planes (triplane), and then captures the global structural relationship through a Transformer, and finally generates an embedding containing component attribution information for each point.
During training, the triplet loss is used to increase the distance between the features of different components, so that the points belonging to the same component are clustered together, and the points of different components are far away from each other.
The following visualization intuitively shows the effect of this step: without this module, it's difficult for the point features to distinguish components at the motion level (the second column); after adding the module and triplet supervision, the features of different components can be well distinguished (the last column).
Step 3: Infer how each component moves
After knowing the component division, the next step is to infer the motion. But there is a subtle difficulty here: to describe the motion of a component, two different types of questions need to be answered simultaneously - "what it is" (semantic: is it a door or a drawer?) and "where its motion occurs" (spatial: where is the rotation center?).
If these two types of information are mixed in the same representation for end - to - end regression, it's often unstable. The Dual - Query Motion Decoder of MonoArt uses a decoupled design: it uses content queries to encode component semantics and position queries to encode spatial motion anchor points, and the two are gradually aligned through six - layer iterative refinement.
In each layer, the relationship between components is modeled through self - attention between queries, and then evidence is extracted from point features through cross - attention. This parallel iterative way of "figuring out what it is and where it is" makes motion reasoning more stable.
Step 4: Output physically usable kinematic parameters
Finally, the Kinematic Estimator converts the previous reasoning results into clear and physically interpretable outputs: the mask of each component, joint type (fixed, rotating, translating, etc.), rotation axis direction, rotation center position, and upper and lower limits of the motion range.
In addition, it also predicts the parent - child relationship between components to build a complete kinematic tree - that is, "which component is connected to which component".
A notable design detail is that the prediction of joint position adopts a residual form, using the position query (i.e., the component centroid) output in the previous step as the anchor point and only predicting the offset. Ablation experiments show that this is more accurate than directly regressing absolute coordinates - this also echoes the "progressive" design philosophy of the entire framework: each step builds on the previous one.
The step - by - step design of these four steps brings a direct benefit: the entire articulation reasoning doesn't require any external prior knowledge - no multi - view, no asset library, no VLM, no auxiliary video generation. So, how about its effect?
Experimental Results
In the PartNet - Mobility benchmark test, MonoArt shows leading performance in both the 7 - class and 46 - class settings.
Compared with representative methods such as SINGAPO, URDFormer, Articulate - Anything, and PhysXAnything, MonoArt achieves the best performance in multiple core indicators such as geometric reconstruction quality, joint type prediction, and key motion parameter estimation, while also taking into account higher reasoning efficiency.
Compared with 229.9s of Articulate - Anything and 256.8s of PhysXAnything, MonoArt only needs 20.5s (18.2 seconds are spent on 3D reconstruction by TRELLIS, and the articulation reasoning itself only adds about 2 seconds of overhead).
At the same time, in downstream tasks, the 3D objects generated by MonoArt can be used in the simulation training of robotic arms. The objects reconstructed by MonoArt can be directly imported into IsaacSim, allowing the Franka robotic arm to grasp objects and open doors without any additional joint annotations.
MonoArt can be extended to the scene generation of objects with movable components.
Limitations and Considerations
MonoArt provides a clear new path for monocular articulated 3D reconstruction: instead of relying on increasingly heavy external prior knowledge to "supplement" motion, it enables the model to truly learn why the object is composed in this way and why it can move in this way through progressive structural reasoning.
However, for small components with extremely unbalanced scales, uniform sampling may result in unobvious features; for very new topological structures or rare patterns, the prediction of motion parameters by the model may also decline. These problems also leave room for future work.
References
[1] TRELLIS: Structured 3D Latents for Scalable and Versatile 3D Generation. CVPR 2025.
[2] URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real - World Images. RSS 2024.
[3] SINGAPO: Single Image Controlled Generation of Articulated Parts in Objects. ICLR 2025.
[4] Articulate - Anything: Automatic Modeling of Articulated Objects via a Vision - Language Foundation Model. ICLR 2025.
[5] PhysX - Anything: Simulation - Ready Physical 3D Assets from Single Image. CVPR 2026.
[6] DreamArt: Generating Interactable Articulated Objects from a Single Image. SIGGRAPH Asia 2025.
[7] Puppet - Master: Scaling Interactive Video Generation as a Motion Prior for Part - Level Dynamics. ICCV 2025.
[8] PARIS: Part - level Reconstruction and Motion Analysis for Articulated Objects. ICCV 2023.
[9] ArticulatedGS: Self - supervised Digital Twin Modeling of Articulated Objects using 3D Gaussian Splatting. CVPR 2025.
[10] PhysX - 3D: Physical - Grounded 3D Asset Generation. NeurIPS 2025.
This article is from the WeChat official account "New Intelligence Yuan" . Author: LRST. Republished by 36Kr with authorization.