HomeArticle

Li Manling, Li Feifei, Wu Jiajun, etc. Join Forces: A New Paradigm for Evaluating Embodied Large Models

新智元2026-03-04 15:51
New Intelligence Yuan: The Theory of Space, a paradigm for evaluating the spatial capabilities of embodied models, has been proposed, breaking through the limitations of traditional static Q&A.

[Introduction] The brand - new embodied model spatial ability evaluation paradigm, Theory of Space, breaks through the limitations of traditional static image - text Q&A. It systematically examines whether the foundation model can, like humans, construct, revise, and utilize spatial beliefs through autonomous exploration in a partially observable dynamic environment. This paper has been accepted by ICLR 2026.

Today's large multimodal models (such as GPT - 5.2, Gemini - 3 Pro) have repeatedly broken records on various visual Q&A leaderboards. However, if we hope to extend these capabilities to more real - world physical scenarios, the models may face significant challenges in spatial understanding. Why is this the case?

Imagine walking into an apartment you've never been to before. You open the door and see a sofa, walk down the corridor and catch a glimpse of the bed in the bedroom, and then further ahead, you find the refrigerator in the kitchen. Now, if someone asks you, "In which direction is the sofa from the refrigerator?" You can usually answer because you've quietly constructed a "mental map" in your mind.

Most humans can do this without much thought. But for current foundation models, the situation may be different. Researchers have found that there may be some differences between the existing evaluation paradigms and the requirements of the real physical world:

  1. From "God's perspective" to "partial observability": Traditional benchmark tests often provide static pictures that cover the entire scene. However, in the real physical space, the field of view of an agent is mostly local. They often need to actively explore and piece together scattered first - person visual clues into a global "cognitive map."
  2. From "passive answering" to "active decision - making": Existing spatial evaluations usually provide fixed observation data to the model. But in a more open environment, the system may need to autonomously decide the direction and target of exploration to obtain environmental information in a more efficient way.
  3. From "static common sense" to "dynamic revision": The physical environment can change dynamically (for example, the position of objects may change). In addition to constructing a map, the agent may also need to update the old spatial memory in a timely manner when it detects environmental changes.

The research teams led by Manling Li from Northwestern University, Fei - Fei Li and Jiajun Wu from Stanford University, and Ranjay Krishna from the University of Washington jointly proposed Theory of Space. They explored: When reducing the reliance on fully given information and requiring the foundation model to understand the environment through active exploration, how will its spatial cognitive ability perform?

Paper link: https://arxiv.org/abs/2602.07055

Code: https://github.com/mll - lab - nu/Theory - of - Space

Project homepage: https://theory - of - space.github.io/

Dataset: https://huggingface.co/datasets/MLL - Lab/tos - data

Theory of Space: Active exploration, belief detection, and task evaluation. The left side shows the agent's action trajectory under the condition of local observation in a multi - room environment from a top - view perspective; the middle shows the cycle of "move - turn - observe" in the text or visual environment and continuously updates the internal beliefs based on the first - person observations; the right side evaluates the representation and usage of these beliefs through spatial tasks and cognitive map probes.

The "Theory of Mind" in spatial intelligence

In cognitive science, Theory of Mind examines whether an agent can infer the hidden mental states of others: "What is he thinking? Does he know this?" It focuses on the modeling of the invisible mental world.

Theory of Space, as its symmetric concept in the physical world, examines whether an agent can infer the unobserved spatial structure in the environment: "What does this world look like? What's behind the door?" It focuses on the modeling of the invisible physical world.

The common essence of the two is that the agent needs to infer the hidden structure based on limited clues and continuously revise its beliefs as new information becomes available.

Researchers define Theory of Space as three closely - coupled core abilities:

  • Construct: Take the initiative to move forward in the fog of partial observability, collect local observations, and piece together a globally consistent "cognitive map" in the internal representation.
  • Revise: In the face of a dynamic environment (such as objects being quietly moved), acutely detect the conflict between the "old memory" and the "new evidence", break the inertia of beliefs, and complete the update of knowledge (Belief Revision).
  • Exploit: Use the well - maintained cognitive map as the most powerful weapon to handle complex downstream spatial reasoning tasks (such as spatial navigation and perspective deduction).

The core of Theory of Space: In a partially observable environment, the agent completes spatial reasoning and decision - making around the construction, dynamic revision, and utilization of spatial beliefs.

Aligning the three major abilities from construction, revision to utilization

Researchers designed a comprehensive evaluation system around the three core abilities of Theory of Space (Construct, Revise, Exploit) and introduced explicit cognitive map probing as the core contribution to directly diagnose the model's internal spatial beliefs.

Construct: Active exploration for map - building

Researchers provided two parallel environments: a text world (symbolic directions/distances) and a visual world (first - person RGB images rendered by ThreeDWorld) in a programmatically generated multi - room indoor layout. The agent must autonomously decide the strategies for movement, rotation, and observation to efficiently construct spatial beliefs. Intuitively, you might think such tasks are just about "looking more." But more importantly, the agent needs to use uncertainty to drive actions and perform efficient information acquisition.

Revise: Update outdated beliefs in a dynamic environment

Learning from the classic "False Belief" paradigm in developmental psychology: After the agent completes the initial exploration, secretly move or rotate several objects to create a conflict between the "old belief " and the "new reality". Can the agent detect the change, overthrow the old memory, and establish new beliefs?

Exploit: Nine types of spatial reasoning tasks

Covering two levels: Route (route reasoning) and Survey (bird's - eye view map reasoning), comprehensively evaluate the utilization value of spatial beliefs.

Overview of downstream spatial tasks

Core contribution: Explicit cognitive map probing

Previous evaluations only looked at the final right or wrong, and the internal beliefs were a black box. Researchers introduced Explicit Cognitive Map Probing: At each step, the model is required to externalize its spatial beliefs in JSON format to measure accuracy, perception quality, stability, and uncertainty modeling. Not only do we know whether the model answers correctly, but also why it answers correctly or incorrectly.

Where exactly does the large model's understanding of space get stuck?

Researchers conducted large - scale in - depth evaluations on six cutting - edge large multimodal models, including GPT - 5.2, Gemini - 3 Pro, and Claude - 4.5 Sonnet. Through white - box probing, they deeply revealed the current large models' boundaries of spatial cognitive abilities:

Insight 1: Active information acquisition is the Achilles' heel of embodied intelligence

When the model decides "what to look at" on its own, the performance drops significantly.

To distinguish between "exploration ability" and "reasoning ability", researchers designed a scripted rule - based proxy agent as an exploration benchmark: The agent in the visual world conducts a 360° scan at each position to ensure full coverage, while the agent in the text world adopts a belief - driven strategy to maximize the elimination of ambiguity. The model conducts reasoning by receiving the observation logs collected by these agents in the passive mode, and needs to autonomously plan exploration in the active mode.

The results are shocking: GPT - 5.2 dropped from 57.1 in the passive mode to 46.0 in the active mode (in the visual world), and Gemini - 3 Pro dropped from 60.5 to 57.3. In terms of efficiency, the rule - based agent only needs about 9 steps to achieve the target coverage, while the foundation model often needs more than 14 steps and the quality of beliefs does not improve. The model "explores a lot" but "explores poorly", with high redundancy and low efficiency. As the environmental complexity increases, this gap further widens.

Task accuracy vs. active exploration cost. The gray icons represent the passive mode. The exploration efficiency and task accuracy of the agent in the active exploration mode are lower than those in the passive mode.

There is a gap between active and passive exploration in both text and visual modalities.

Discovery 2: Modal gap

Text reasoning is far stronger than visual reasoning for all models without exception.

Whether in the passive or active setting, the models perform consistently and significantly better in the text environment than in the visual environment. This reveals the fundamental limitation of current multimodal models in spatial perception: The models have difficulty effectively extracting spatial information from visual observations and highly rely on symbolic representations for logical reasoning of key relationships.

There is a huge performance gap between the visual and text modalities in both the passive mode and active exploration.

Discovery 3: Triple crisis of the cognitive map

Through cognitive map probing, researchers further found that orientation perception is the bottleneck (the judgment of object orientation in the visual world is close to random); beliefs are unstable (the correctly perceived information degenerates over time); belief drift (new incorrect updates overwrite the previously correct perception). In other words, the model doesn't "fail to see" but "fails to remember" or "remembers incorrectly".

Discovery 4: The cognitive map is an effective diagnostic tool

Researchers verified the effectiveness of the cognitive map as a diagnostic tool through ablation experiments:

Sufficiency verification: After providing the model with a real cognitive map, the accuracy of downstream tasks soars to ~95%, proving that the JSON map format captures all the information required to complete the tasks.

Relevance verification: There is a significant positive correlation between the accuracy of the cognitive map and the performance of downstream tasks (Pearson r = 0.42~0.65, p < 0.001).

Although the externalized map is a lossy compression of the model's internal beliefs, it is still a powerful diagnostic signal.

Discovery 5: Belief inertia