Elephants transform into excavators in seconds: A new breakthrough in 3D deformation without additional training
[Introduction] Nanjing University and Peking University have proposed MorphAny3D, which enables 3D generation models to achieve smooth cross-category deformation without training. By innovatively fusing source and target features through an attention mechanism, it precisely controls the structure and timing, easily accomplishing complex deformations with results far superior to traditional methods.
3D deformation aims to achieve a smooth transition from a source object to a target object. Although 2D image generation models have promoted significant progress in image deformation technology, 3D deformation still faces the following bottlenecks due to the complexity of its spatial structure:
(1) Difficulty in cross-category matching: Traditional 3D deformation methods based on matching rely heavily on the dense correspondence between the source and the target. When dealing with cross-category objects (such as "an elephant transforming into an excavator"), this matching mechanism often fails, leading to structural distortion or even collapse during the deformation process, as shown in Figure 1-(a).
(2) Lack of temporal consistency: Another intuitive approach is to first generate a sequence using 2D deformation and then "elevate the dimension" through a 3D generation model. However, this method lacks inter-frame constraints and it is difficult to ensure the temporal consistency of the deformation, as shown in Figure 1-(b).
Figure 1: Qualitative comparison of different deformation schemes. is the deformation weight for controlling the deformation progress.
Currently, the field of 3D generation is progressing rapidly. In particular, Trellis[1] has achieved high-quality and diverse image-to-3D capabilities by encoding 3D assets into structured latent variables (Structured Latent, SLAT).
This raises the question: Can SLAT be introduced into 3D deformation to fully utilize its powerful 3D generation prior?
MorphAny3D is based on this motivation. By deeply exploring the fusion rules of SLAT in the attention mechanism, a series of efficient training-free components are constructed to achieve smooth and reasonable cross-category 3D deformation.
In response to this problem, Sun Xiaokun from the PCA-Lab of Nanjing University, under the guidance of Associate Professor Zhang Zhenyu, proposed the training-free 3D deformation framework MorphAny3D in the latest work at CVPR 2026.
Project homepage: https://xiaokunsun.github.io/MorphAny3D.github.io
Paper link: https://arxiv.org/pdf/2601.00204
Code link: https://github.com/XiaokunSun/MorphAny3D
This method successfully activates the potential of the 3D generation prior in the field of deformation by skillfully fusing the features of the original object and the target object in the attention mechanism of the large 3D generation model, achieving high-quality cross-category 3D deformation.
In addition, MorphAny3D has strong generalization ability, supporting various applications such as decoupled deformation, dual-target deformation, and 3D stylization, and can be seamlessly migrated to 3D generation models of the same architecture. The code is now open source!
MorphAny3D
Figure 2: Framework diagram of MorphAny3D.
Figure 2-(a) shows the framework of MorphAny3D. Based on the SLAT fusion rules observed by the authors in the cross/self-attention modules, the authors introduced the Morphing Cross-Attention module (MCA, Figure 2-(b)) and the Temporal-Fused Self-Attention module (TFSA, Figure 2-(c)) to improve the rationality and temporal coherence of the deformation. In addition, the authors also proposed an Orientation Correction strategy (OC, Figure 2-(d)), which is based on the statistical analysis of the orientation distribution of the Trellis generation results and aims to suppress sudden orientation jumps.
Figure 3: Quantitative comparison of different deformation schemes.
Fusion rules of SLAT in the attention mechanism
In the early stage of the research, the authors tried the most direct fusion scheme: directly interpolating the image conditions and initial noise of the source object and the target object. As can be seen from Figure 1-(c), the effect of this strategy is not ideal, which is also confirmed by the quantitative indicators FID [2] (the lower, the better the rationality) and PPL [3] (the lower, the better the smoothness) in Figure 3.
To seek a better solution, the authors tried to migrate the attention key-value fusion strategy verified to be effective in previous deformation work [3, 4] to SLAT. This strategy is expressed as:
where represents the query of the -th frame of deformation, and and are the key and value from the source object and the target object respectively, and is the deformation weight for controlling the deformation progress. In the Cross-Attention module, the key and value come from the image conditions guiding the generation; in the Self-Attention module, the key and value come from the latent features themselves.
The authors compared the effects of fusing only in the Cross-Attention module (KV-Fused CA), only in the Self-Attention module (KV-Fused SA), and applying both (see Figure 1-(d, e, f) and Figure 3), and drew the following conclusions:
- KV-Fused CA can significantly enhance the structural and semantic rationality of 3D deformation by fusing 2D conditional semantics in cross-attention (achieving the lowest FID), but it is prone to local deformed structures as shown in Figure 1-(d).
- KV-Fused SA can effectively improve the smoothness and continuity of the sequence by aggregating 3D latent features in self-attention (achieving the lowest PPL).
- However, when the above strategies are applied simultaneously, KV-Fused SA will interfere with and destroy the structural rationality brought by KV-Fused CA, failing to achieve the expected effect of 1+1>2.
It can be seen that compared with simple feature interpolation, deeper key-value fusion has initially released the potential of SLAT in the field of deformation. However, to achieve truly continuous and reasonable cross-category deformation, further improvements are still needed.
For this reason, the authors made targeted modifications based on the attention key-value fusion, thus fully releasing the performance potential of SLAT.
Morphing Cross-Attention module (MCA)
As mentioned above, although KV-Fused CA improves the structural rationality of the deformation, it inevitably introduces local artifacts. The authors speculate that the root of the problem lies in the semantic confusion generated during the "patch-wise" feature fusion process of the source image and the target image. Specifically, the key and value in cross-attention come from the patch-wise DINOv2 features. However, the source and target image features aligned in space do not necessarily have the same semantics. This direct weighted summation often causes the generation model to receive guiding information with semantic conflicts, which in turn generates distorted structures. To verify this conjecture, the authors analyzed the attention maps of the head SLAT (marked by a red star) under different mechanisms, as shown in Figure 4.
Figure 4: Attention maps under different cross-attention mechanisms. The red star marks the head SLAT, and the pink star marks the corresponding head image condition. The orange box highlights the incorrect attention focus of KV-Fused CA. MCA retains the correct and semantically consistent attention, thus avoiding the local distorted structure of KV-Fused CA as shown in Figure 2-(d).
Observing the second and third columns of Figure 4, it can be found that the native cross-attention can accurately focus on the corresponding image condition (marked by the red star) when processing the head SLAT, proving its ability to implicitly establish the semantic correspondence between 2D conditions and 3D latent features. However, KV-Fused CA in the fourth column incorrectly focuses on the background area (see the orange box), causing the features with semantic mismatch to mislead the generation process and ultimately leading to local distortion.
For this reason, the authors proposed the Morphing Cross-Attention module (MCA). Different from KV-Fused CA which fuses the key and value in advance, MCA adopts the strategy of "calculating independently first and then weighted fusion output":
As shown in the last column of Figure 4, MCA maintains accurate attention to the semantically consistent area by independently processing the source features and the target features, thus avoiding the artifacts observed in KV-Fused CA. Although the change of MCA seems to be just an adjustment of the calculation and fusion order, its core value lies in: inheriting the "precise focusing" characteristic of the native attention mechanism and ensuring the semantic consistency of the conditional features, providing a simple and efficient solution for high-quality 3D cross-category deformation.
Temporal-Fused Self-Attention module (TFSA)
Although MCA ensures the rationality of the structure and semantics, due to the lack of explicit temporal dependence between frames, there is still room for improvement in the smoothness of the deformation sequence.
For this reason, the authors proposed the Temporal-Fused Self-Attention module (TFSA).
Different from KV-Fused SA which fuses the key and value directly before the attention calculation, TFSA adopts a backward temporal constraint strategy. When generating the -th frame, TFSA fuses the attention outputs of the key and value of the current frame and the previous frame:
where is used to control the influence degree of the memory of the previous frame on the current frame. Different from KV-Fused SA, TFSA fuses the features of the generated adjacent deformation frames, which not only enhances the smoothness of the sequence but also avoids destroying the semantic rationality due to global feature aggregation, achieving a balance between temporal stability and spatial structure consistency.
Orientation Correction strategy (OC)
In addition, the authors also observed that the orientation of the object sometimes changes suddenly during the deformation process, as shown in Figure 5-(a). Even though TFSA can improve the temporal consistency, it is difficult to deal with such a large-scale pose mutation. To solve this problem, the authors analyzed a large number of cases of orientation mutation in the deformation sequences generated under MCA and TFSA, and summarized two features: the orientation jumps are mainly concentrated in the middle stage of the deformation.
At this time, the source and target image conditions are in an "ambiguous" transition state (as shown in Figure 5-(b)), and the generation model is easily interfered with. In addition, the strong changes in orientation are highly concentrated at 90°, 180°, and 270° of the yaw angle, while the pitch angle and roll angle basically remain stable, as shown in Figure 5-(c). This indicates that the orientation jump is not random noise but stems from a systematic deviation. The authors speculate that its root lies in the generation pose prior learned by the Trellis model.
By analyzing the pose distribution of 1000 Trellis generation samples, as shown in Figure 5-(d), the authors found that although the vast majority of samples maintain the standard pose, the non-standard poses are precisely clustered at the above yaw angles, thus confirming the strong coupling between the orientation jump and the pose distribution learned by Trellis.
Figure 5: (a) Example of orientation mutation, (b) Distribution map of deformation weight when orientation mutation occurs, (c) Distribution map of orientation change when orientation mutation occurs, (d) Distribution map of orientation of Trellis generation results.
Based on these observations, the authors proposed the Orientation Correction strategy (OC). Its core process is as follows:
After generating the sparse structure of the -th frame, four yaw angle rotation candidates are first created, and then the chamfer distance between each candidate and the structure of the previous frame is calculated. The candidate with the smallest distance is selected as the corrected structure.
Since the orientation jump mainly occurs in the middle stage of the deformation, this strategy effectively constrains the pose of the subsequently generated object by utilizing the stable pose in the early stage. When the object orientation does not jump,