Künstliche Intelligenz verbessert die visuelle Verständnisfähigkeit sprunghaft beim Puzzle-Spiel, verabschiedet sich von der textzentrierten Ausbildung und bietet ein neues Nachausbildungsmodell für multimodale Große Modelle ohne Annotationen
In the post-training wave of multimodal large models, the paradigm driven by reinforcement learning has become a key direction for enhancing the model's reasoning and general abilities.
However, most existing methods still focus on text, and the visual part is often passively used as an auxiliary signal input. In contrast, we believe that reexamining the potential of visual self-supervised learning in the post-training stage and designing vision-centric post-training are equally crucial for enhancing the fine-grained and in-depth understanding of visual information itself by multimodal large models.
To this end, the latest paper "Visual Jigsaw Post-Training Improves MLLMs" from MMLab@Nanyang Technological University proposes a brand-new post-training task for multimodal large models - Visual Jigsaw.
It redesigns the classic self-supervised jigsaw puzzle task as the core objective in the post-training stage of multimodal large models, enabling the model to explicitly strengthen its visual perception and understanding abilities without relying on additional annotations or a visual generation module. Its effectiveness has been verified in three visual modalities: images, videos, and 3D.
Introduction to the Visual Jigsaw Method
Visual Jigsaw can be regarded as a general class of tasks for the sequential reconstruction of visual information. Given data in a certain visual modality (image, video, 3D), it is specifically partitioned and randomly shuffled to obtain a set of sub-elements as puzzle pieces. The model's goal is to reconstruct the visual information, predict their correct order, and output the corresponding arrangement order in text form. The entire training process is optimized using the reinforcement learning algorithm GRPO.
Visual Jigsaw has corresponding ground truth (GT) for direct verification. The team designed a hierarchical reward mechanism: the reward is 1 when the prediction is completely correct; if some positions are correct, the reward is given according to the correct proportion and multiplied by a discount coefficient to prevent the model from over-relying on partial matches; if the output is not a valid arrangement, the reward is 0.
For different visual modalities, the specific designs of the Visual Jigsaw tasks are as follows:
Image Jigsaw: The image is divided into several sub-images of the same size in the 2D space. After shuffling, the model needs to restore the correct spatial order.
Video Jigsaw: The video is segmented into equal-length video clips in the time dimension. The model needs to reconstruct the original temporal order.
3D Jigsaw: Multiple depth points are sampled from the RGB-D image, and the positions of the corresponding points and their shuffled serial numbers are marked on the image. The model is required to restore the depth order from near to far.
Experimental Results
The effectiveness of Visual Jigsaw has been verified on various image, video, and 3D modalities:
Image Jigsaw
After training with Image Jigsaw, the model shows stable improvements on three types of vision-centric benchmarks:
1) Fine-grained perception and understanding, 2) Spatial perception and understanding based on monocular images, 3) Compositional visual understanding and reasoning.
The results show that introducing the post-training of Image Jigsaw into multimodal large models can significantly enhance their perception ability and fine-grained visual understanding ability, which is exactly what the existing post-training strategies mainly focused on reasoning lack.
This improvement comes from the requirements of the jigsaw puzzle task itself - the model must pay attention to the details of local patches, reason about the overall spatial layout, and understand the relationships between different patches, all of which directly promote fine-grained, spatial, and compositional understanding.
Video Jigsaw
After training with Video Jigsaw, the model shows stable improvements on various general video understanding benchmarks. This method generally enhances the model's perception and understanding of videos, and the improvement is particularly significant in tasks that require temporal dimension reasoning and understanding of temporal directionality (such as AoTBench).
Meanwhile, the significant improvement on CVBench also verifies the enhancement of the model's cross-video understanding and reasoning ability. This indicates that the video jigsaw task can enable the model to better capture temporal continuity, understand the relationships between videos, reason about directional consistency, and ultimately improve its overall and general understanding ability of videos.
3D Jigsaw
After training with 3D Jigsaw, the model achieves significant improvements on various 3D benchmark tasks. The most prominent improvement appears in DA-2K, which is directly related to depth estimation, and this is a direct manifestation of the depth sorting pre-training task. But more importantly, consistent improvements are also observed on a wide range of other tasks, including single-view benchmarks (such as 3DSRBench, OmniSpatial), multi-view benchmarks (such as ViewSpatial, All-Angles), and first-person video benchmarks (such as VSI-Bench). These results show that this method not only enables the model to master the specific skill of depth sorting but also effectively enhances its overall three-dimensional spatial perception and reasoning ability.
Conclusion
Visual Jigsaw provides a new lightweight, verifiable, and annotation-free self-supervised post-training paradigm centered on vision, injecting new vitality into the visual perception of MLLMs. The team hopes that this work not only demonstrates the potential of the visual jigsaw task but also inspires the academic community to design more self/weakly supervised tasks focusing on visual information itself, enabling multimodal large models to better perceive and understand various types of visual information.
Paper link: https://arxiv.org/abs/2509.25190
Project homepage: https://penghao-wu.github.io/visual_jigsaw/
HF link for data and models: https://huggingface.co/collections/craigwu/visual-jigsaw-68d92d6aca580f3dc7e3cf36
Code repository link: https://github.com/penghao-wu/visual_jigsaw
This article is from the WeChat official account "QbitAI". Author: VisualJigsaw Team. Republished by 36Kr with permission.