HomeArticle

New SOTA for complex spatial reasoning with a 55% performance boost: SpatialDreamer, a new work from Sun Yat-sen University

新智元2025-12-22 18:10
SpatialDreamer improves the performance of spatial intelligence tasks through active imagination and reasoning.

[Introduction] Institutions such as Sun Yat-sen University have introduced SpatialDreamer, which significantly improves the performance of complex spatial tasks through active mental imagination and spatial reasoning. By simulating the process of human active exploration, imagination, and reasoning, it addresses the limitations of existing models in tasks such as perspective transformation, opening up a new path for the development of spatial intelligence in artificial intelligence.

Although multimodal large language models (MLLMs) have made significant progress in scene understanding, their performance remains limited in complex spatial reasoning tasks that require mental simulation.

Most existing methods rely on passive observation of spatial data and lack the ability of active imagination and dynamic updating of internal representations unique to human spatial cognition.

For example, in tasks that require changing perspectives to determine the position of occluded objects, existing models often fail to reason due to a single perspective.

To address this, a research team from MBZUAI and Sun Yat-sen University proposed SpatialDreamer, a reinforcement learning-based framework aiming to endow MLLMs with human-like spatial mental simulation capabilities through a closed-loop process of active exploration, visual imagination, and evidence fusion.

Paper link:  https://arxiv.org/pdf/2512.07733

SpatialDreamer simulates the human spatial cognition process and constructs a closed-loop reasoning process consisting of the following three steps:

1) Exploration: The model infers the optimal egocentric action (such as "move forward 0.75 meters" or "turn left 45 degrees") based on the current scene;

2) Imagination: Invoke a world model (such as SVC) to generate a new perspective image after executing the action;

3) Reasoning: Integrate all accumulated visual evidence to generate the final answer.

This process enables the model to shift from "passive observation" to "active goal-oriented imagination", allowing it to autonomously decide "where to look, what to look at, and how to reason" in the internal three-dimensional environment.

To address the problem of sparse rewards in long-sequence reasoning tasks, the research team proposed GeoPO, a policy optimization method that combines a tree-based sampling structure with geometric consistency constraints:

1) Tree-based sampling: Sample multiple action branches at each step to support backtracking and multi-path exploration;

2) Multi-level reward design: Combine task-level rewards with step-level rewards to provide fine-grained feedback;

3) Geometric penalty mechanism: Apply a penalty coefficient (such as 0.9) to redundant or conflicting actions (such as continuous forward or reverse movements) to encourage the generation of efficient trajectories.

GeoPO not only improves the model's performance but also significantly accelerates the training convergence speed.

To further guide the model to learn the "think - imagine - answer" pattern, the SpatialDreamer-SFT dataset was constructed, including single-pass reasoning data and reflective reasoning data, where reflective reasoning is constructed through "error injection → self-correction → reconstruction of the reasoning chain".

Experimental Results

The research team verified the effectiveness of SpatialDreamer on multiple spatial reasoning benchmarks:

1) SAT: Achieved state-of-the-art (SOTA) results in both real and synthetic images, with average accuracies of 93.9% and 92.5% respectively;

2) MindCube-Tiny: The overall accuracy was 84.9%, more than 55% higher than the baseline Qwen2.5-VL-7B;

3) VSI-Bench: Led comprehensively in tasks such as object counting, relative direction, and path planning, with an average accuracy of 62.2%

Towards General Intelligence with Spatial Imagination

The significance of SpatialDreamer lies not only in improving the accuracy of spatial reasoning but, more importantly, in proving that MLLMs can enhance their reasoning ability through "imagination", taking an important step towards human-like spatial intelligence.

Reference: https://arxiv.org/pdf/2512.07733 

This article is from the WeChat official account "New Intelligence Yuan", edited by LRST. Republished by 36Kr with permission.