Does VLM always fail in solving geometry problems? GEODPO starts from "seeing": using structured representation + DPO optimization to enable the model to understand before reasoning.
Is the geometric problem really just about "difficulty in reasoning"?
In recent years, Vision-Language Models (VLMs) have made significant progress in multi-modal tasks such as image-text question answering, table understanding, and mathematical word problems.
However, when the problems turn into geometric figures, their performance often drops significantly.
Why?
Recently, a research team from Guangming Laboratory and Tsinghua University conducted an in-depth analysis of error cases of multiple mainstream models and observed a notable phenomenon:
The failure of current VLMs in geometric problems largely exposes their shortcoming in geometric perceptual errors, and this core factor is often not systematically analyzed separately in existing research.
In other words, in many cases, the model is not unable to reason, but rather has already deviated in the earlier stage - the recognition of the graphic structure.
Common problems include:
- Wrong identification of basic geometric elements (points, lines, circles)
- Missed detection of key structural relationships (collinearity, perpendicularity, tangency)
- Image grounding offset
- Identification of non-existent structures (structural hallucination)
These problems occur before reasoning but directly affect the subsequent logical chain.
GEOPERCEIVE: The First Independent Evaluation of Geometric Perception Ability
Existing geometric benchmarks usually adopt an end-to-end evaluation method:
Image + Question → Natural language answer
Only judge "whether the answer is correct".
This will mix and count perceptual errors and reasoning errors, making it difficult to locate the ability bottleneck.
For this reason, the research team proposed GEOPERCEIVE
This is the first independent evaluation framework for geometric perception ability.
Previous benchmarks focused on: Whether the model "answers correctly".
GEOPERCEIVE focuses on: Whether the model "sees correctly".
Expressing Geometry with Programs: GeoDSL
The research team designed a domain-specific language for geometry - GeoDSL, used for structured representation:
- Geometric elements: Point / Line / Circle
- Structural relationships: Collinear / Perpendicular / Tangent
- Topological and dependency constraints
Geometric figures are first automatically generated by programs and then rendered into images.
The natural language results output by the model will be translated into a structural representation and precisely matched.
This design brings two key advantages:
- Controllable generation of geometric structures with different complexities
- Accurate and automated structure-level scoring
Element-Level Structure Scoring
GEOPERCEIVE adopts:
- Structure parsing
- Hungarian matching
- Element-level F1 scoring
The evaluation granularity is refined from "whether the answer is correct" to:
Whether each geometric element and each structural relationship are accurately recognized.
This enables the research team to precisely locate the model's ability bottleneck at the structural recognition level.
GEODPO: Structured Reinforcement Learning Optimization Path
After diagnosing the shortcoming in geometric perception, a natural question is:
How to introduce structure-level optimization signals without destroying the natural language expression ability?
Directly supervising the model to generate structured programs (SFT) easily leads to distribution shift and is highly sensitive to the token order.
Therefore, the research team proposed:
GEODPO: Translator-Guided Reinforcement Learning
The overall process is as follows:
Natural language output
→ Specialized translator (NL → GeoDSL)
→ Structure-level precise scoring
→ Construction of preference pairs
→ DPO optimization
The model still outputs natural language, but the optimization signal comes from the structure matching score.
This method has three advantages:
Does not change the model output space
The reward function is interpretable and computable
The optimization goal is directly aligned with the structure recognition ability
Experimental Observations
The research team conducted a systematic evaluation on multiple mainstream vision-language models.
Improvement in Geometric Perception Ability
Multiple backbones have achieved significant improvements
Compared with direct SFT, GEODPO performs more stably
OOD Generalization Ability
On the out-of-distribution test set:
- GEODPO maintains a continuous improvement trend
- SFT shows performance fluctuations on some models
This suggests that structured rewards may have better stability in distribution shift scenarios.
Downstream Geometric Reasoning Tasks
On geometric reasoning benchmarks such as MathVista, the research team observed:
When the accuracy of structure recognition improves, the overall reasoning performance often improves synchronously.
This phenomenon indicates that the quality of the underlying structure representation may be one of the important factors affecting geometric reasoning performance.
Summary
The research team proposed:
GEOPERCEIVE - The first independent evaluation framework for geometric perception ability
GEODPO - An optimization method based on structured rewards
By explicitly separating geometric structure recognition from end-to-end reasoning tasks, the research team can more clearly analyze the model's ability distribution in the "perception - reasoning" chain.
The experimental results show:
Geometric perception ability may be one of the important factors affecting geometric reasoning performance, and structured reinforcement learning provides a stable and interpretable optimization path.
More importantly, this work provides a research paradigm:
- Decompose complex abilities into independently evaluable sub-modules
- Replace fuzzy language matching with structured representation
- Guide the model's ability alignment with a computable reward function
Due to its highly structured nature, the geometric scenario provides an ideal entry point for studying the underlying representation ability of multi-modal models.
Similar ideas may be extended to:
- Engineering drawing analysis
- Scientific image understanding
- CAD structure recognition
- Medical structure modeling
In the process of multi-modal models gradually moving towards more reliable structure understanding, geometry may not just be a type of task, but a key window to understand whether the model "truly understands the structure".
Paper link: https://arxiv.org/pdf/2602.22703
This article is from the WeChat official account "QbitAI", author: Guangming Laboratory & Tsinghua University. Published by 36Kr with authorization.