Does VLM Always Fail in Solving Geometry Problems? GEODPO: Starting from "Seeing", Using Structured Representation + DPO Optimization to Enable Model Understanding before Reasoning

Use structured reinforcement learning to enable VLMs to "understand" geometry

Is the geometric problem really just about "difficulty in reasoning"?

In recent years, Vision-Language Models (VLMs) have made significant progress in multi-modal tasks such as image-text question answering, table understanding, and mathematical word problems.

However, when the problems turn into geometric figures, their performance often drops significantly.

Why?

Recently, a research team from Guangming Laboratory and Tsinghua University conducted an in-depth analysis of error cases of multiple mainstream models and observed a notable phenomenon:

The failure of current VLMs in geometric problems largely exposes their shortcoming in geometric perceptual errors, and this core factor is often not systematically analyzed separately in existing research.

In other words, in many cases, the model is not unable to reason, but rather has already deviated in the earlier stage - the recognition of the graphic structure.

Common problems include:

Wrong identification of basic geometric elements (points, lines, circles)
Missed detection of key structural relationships (collinearity, perpendicularity, tangency)
Image grounding offset
Identification of non-existent structures (structural hallucination)

These problems occur before reasoning but directly affect the subsequent logical chain.

GEOPERCEIVE: The First Independent Evaluation of Geometric Perception Ability

Existing geometric benchmarks usually adopt an end-to-end evaluation method:

Image + Question → Natural language answer

Only judge "whether the answer is correct".

This will mix and count perceptual errors and reasoning errors, making it difficult to locate the ability bottleneck.

For this reason, the research team proposed GEOPERCEIVE

This is the first independent evaluation framework for geometric perception ability.

Previous benchmarks focused on: Whether the model "answers correctly".

GEOPERCEIVE focuses on: Whether the model "sees correctly".

Expressing Geometry with Programs: GeoDSL

The research team designed a domain-specific language for geometry - GeoDSL, used for structured representation:

Geometric elements: Point / Line / Circle
Structural relationships: Collinear / Perpendicular / Tangent
Topological and dependency constraints

Geometric figures are first automatically generated by programs and then rendered into images.

The natural language results output by the model will be translated into a structural representation and precisely matched.

This design brings two key advantages:

Controllable generation of geometric structures with different complexities
Accurate and automated structure-level scoring

Element-Level Structure Scoring

GEOPERCEIVE adopts:

Structure parsing
Hungarian matching
Element-level F1 scoring

The evaluation granularity is refined from "whether the answer is correct" to:

Whether each geometric element and each structural relationship are accurately recognized.

This enables the research team to precisely locate the model's ability bottleneck at the structural recognition level.

GEODPO: Structured Reinforcement Learning Optimization Path

After diagnosing the shortcoming in geometric perception, a natural question is:

How to introduce structure-level optimization signals without destroying the natural language expression ability?

Directly supervising the model to generate structured programs (SFT) easily leads to distribution shift and is highly sensitive to the token order.

Therefore, the research team proposed:

GEODPO: Translator-Guided Reinforcement Learning

The overall process is as follows:

Natural language output

→ Specialized translator (NL → GeoDSL)

→ Structure-level precise scoring

→ Construction of preference pairs

→ DPO optimization

The model still outputs natural language, but the optimization signal comes from the structure matching score.

This method has three advantages:

Does not change the model output space

The reward function is interpretable and computable

The optimization goal is directly aligned with the structure recognition ability

Experimental Observations

The research team conducted a systematic evaluation on multiple mainstream vision-language models.

Improvement in Geometric Perception Ability

Multiple backbones have achieved significant improvements

Compared with direct SFT, GEODPO performs more stably

OOD Generalization Ability

On the out-of-distribution test set:

GEODPO maintains a continuous improvement trend
SFT shows performance fluctuations on some models

This suggests that structured rewards may have better stability in distribution shift scenarios.

Downstream Geometric Reasoning Tasks

On geometric reasoning benchmarks such as MathVista, the research team observed:

When the accuracy of structure recognition improves, the overall reasoning performance often improves synchronously.

This phenomenon indicates that the quality of the underlying structure representation may be one of the important factors affecting geometric reasoning performance.

Summary

The research team proposed:

GEOPERCEIVE - The first independent evaluation framework for geometric perception ability

GEODPO - An optimization method based on structured rewards

By explicitly separating geometric structure recognition from end-to-end reasoning tasks, the research team can more clearly analyze the model's ability distribution in the "perception - reasoning" chain.

The experimental results show:

Geometric perception ability may be one of the important factors affecting geometric reasoning performance, and structured reinforcement learning provides a stable and interpretable optimization path.

More importantly, this work provides a research paradigm:

Decompose complex abilities into independently evaluable sub-modules
Replace fuzzy language matching with structured representation
Guide the model's ability alignment with a computable reward function

Due to its highly structured nature, the geometric scenario provides an ideal entry point for studying the underlying representation ability of multi-modal models.

Similar ideas may be extended to:

Engineering drawing analysis
Scientific image understanding
CAD structure recognition
Medical structure modeling

In the process of multi-modal models gradually moving towards more reliable structure understanding, geometry may not just be a type of task, but a key window to understand whether the model "truly understands the structure".

Paper link: https://arxiv.org/pdf/2602.22703

This article is from the WeChat official account "QbitAI", author: Guangming Laboratory & Tsinghua University. Published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Does VLM always fail in solving geometry problems? GEODPO starts from "seeing": using structured representation + DPO optimization to enable the model to understand before reasoning.