HomeArticle

Does more reasoning lead to more severe hallucinations? The "hallucination paradox" of multimodal reasoning models

新智元2025-06-26 07:05
Multi-modal models generate hallucinations in long reasoning chains, and new metrics are used to evaluate and balance them.

[Introduction] Do multimodal reasoning models really "understand better the more they think"? Research shows that as the reasoning chain of the R1 series of models lengthens, their visual perception ability tends to decline. The generated content sometimes deviates from the image itself, resulting in the hallucination phenomenon of "seeing" non - existent things. The improvement of reasoning ability is accompanied to some extent by the weakening of visual alignment, showing a tendency of "more reasoning, more hallucinations". This phenomenon has triggered in - depth thinking among researchers about how to dynamically balance perception and reasoning in multimodal reasoning models: when the model continuously pursues greater reasoning depth, is it also losing its visual anchor to the real world?

In the rapid development of large multimodal models, the R1 series of multimodal reasoning models have repeatedly broken through the performance bottlenecks of the traditional "fast thinking" paradigm in complex tasks with their explicit long - chain reasoning mechanism.

However, research has found that as the reasoning chain lengthens, the visual perception ability of such models shows an obvious downward trend. They gradually rely on language priors for "mental filling", and the generated content is more and more likely to deviate from the image itself, and even the phenomenon of fabricating out of thin air occurs.

This paradox of "enhanced reasoning - weakened perception" highlights the balance challenge faced by current multimodal reasoning models between reasoning ability and perception accuracy.

To further verify this phenomenon, a research team from the University of California, Santa Cruz, the University of California, Santa Barbara, and Stanford University conducted a systematic analysis.

By introducing a reasoning length control mechanism and an interpretable attention visualization method, the researchers found that as the reasoning chain extends, the model's attention to the image content decreases significantly, while its dependence on language prompts continues to increase, highlighting the trend of visual deviation dominated by language.

Paper link: https://arxiv.org/pdf/2505.21523

Project link: https://mlrm - halu.github.io

Code link: https://github.com/MLRM - Halu/MLRM - Halu 

On this basis, the team proposed a new evaluation index RH - AUC and constructed a supporting diagnostic benchmark set RH - Bench, which for the first time systematically quantified the balance performance of multimodal reasoning models between reasoning ability and visual perception stability.

This tool not only improves the measurability of the model's hallucination risk but also provides an important reference for the robustness evaluation and improvement of future multimodal systems.

The amplification effect of visual hallucinations brought by enhanced reasoning

In the evolution of current large multimodal models, R1 - type reasoning models have shown strong expressive ability in complex tasks due to the introduction of an explicit long - chain language reasoning process (Reasoning Chain).

However, the researchers systematically observed a widely overlooked phenomenon: as the length of the reasoning chain deepens, the model's visual alignment ability in perception tasks decreases significantly, and the risk of hallucination increases accordingly.

This trend was clearly observed in multiple empirical comparisons.

For example, in Figure (b), the researchers compared the performance of multiple 7B - scale multimodal models in reasoning and perception tasks: although models such as R1 - OneVision - 7B have certain advantages in reasoning accuracy, their accuracy in perception tasks drops to the lowest level, significantly lower than that of non - reasoning models of the same scale (such as Qwen2.5 - VL - 7B).

This shows that the deepening of the reasoning chain is not an "unpaid" enhancement but comes at the cost of sacrificing image perception ability and amplifying hallucinations.

Specifically, when the model gradually extends its language chain in image - text tasks, the image evidence signals that should support the answer are quietly marginalized.

Taking a typical visual question - answering task as an example, the lengthy output generated by the reasoning model often does not really refer to the image content but relies on language common sense to "mentally fill" an answer that sounds reasonable but does not exist in the image. This phenomenon repeatedly appears in multiple perception evaluation benchmarks (such as MMVP, MMHAL).

As shown in the figure, in the comprehensive evaluation of multiple visual perception tasks, R1 - type models are generally lower than the Base models of the same scale, especially in MMHAL and MMVP, which require delicate image alignment ability, the gap is more significant.

This further confirms that the enhancement of the reasoning chain not only fails to improve the perception quality but also exacerbates the model's tendency to "answer without referring to the image".

In summary, the enhancement of the reasoning chain is not without cost. A "smarter" reasoning model may "see less" in perception - related tasks.

The smarter, the more likely to make mistakes?

To deeply understand why multimodal reasoning models are more likely to produce hallucinations, the research team conducted a systematic analysis of the attention distribution inside the model and revealed a structural mechanism: enhanced reasoning is not a free lunch. It sacrifices visual attention in exchange for the improvement of language reasoning ability.

Specifically, compared with non - reasoning models, R1 - type reasoning models significantly reduce their attention to visual tokens during the generation process and instead allocate a large amount of attention to instruction tokens and language context (Figure a).

More importantly, this "attention transfer" is not a fixed bias but intensifies layer by layer as the reasoning chain extends - the later the layer, the more the model tends to ignore the image input and rely entirely on language signals for reasoning.

As shown in Figure (b), in the visual focusing task, the non - reasoning model (Qwen2.5 - VL) shows stable attention to the key areas in the image (such as cheese) in multiple layers; while the R1 model (R1 - OneVision) shows obvious visual degradation in its attention heat map under the same problem, and is almost completely out of focus in the deep layers.

This structural shift makes the model often "guess based on language" even when facing problems that clearly depend on the image, and finally generate hallucinated answers that are seriously out of touch with the image.

Moreover, the research found that this phenomenon is particularly obvious when the model enters the "overthinking" stage.

As the reasoning chain extends, the model's attention to visual tokens continues to weaken, while its attention to language tokens such as instructions increases significantly, resulting in the generation process relying more and more on language clues rather than image content.

The "length paradox" of the reasoning chain: the more you think, the greater the hallucination?

Is it really better for the model's reasoning chain to be longer? The research team compared three different reasoning length control strategies in multiple benchmark tests (Token Budget Forcing, Test - Time Scaling, and Latent State Steering) and for the first time systematically revealed a key phenomenon: there is a non - monotonic "inverted U - shaped" relationship between the length of the reasoning chain and the model's performance.

As shown in the figure, in reasoning - dominated tasks (the left two figures), the model's accuracy first increases as the reasoning chain extends but then decreases when the chain is too long, indicating that "overthinking" does not necessarily bring stronger reasoning ability.

In perception - dominated tasks (the right two figures), as the reasoning length increases, the hallucination rate continues to rise, indicating that redundant language generation will systematically interfere with visual alignment.

This trend emphasizes that reasonably controlling the reasoning length is the key to improving the model's robustness and the balance ability between perception and reasoning.

The introduction of indicators such as RH - AUC also provides a more interpretable quantitative description of this non - linear relationship.

RH - AUC: Dynamic trade - off evaluation of reasoning and hallucination

Facing the dilemma of enhanced reasoning and amplified hallucination in multimodal models, the research team proposed a new evaluation index: RH - AUC (Reasoning - Hallucination Area Under Curve).

Different from traditional indicators that only evaluate accuracy or hallucination rate at a single reasoning length, RH - AUC starts from an overall perspective and measures the dynamic balance level of the model's "thinking ability" and "seeing ability" at different reasoning depths.

Specifically, in the newly constructed RH - Bench dataset (containing 1000 samples across perception and reasoning), the reasoning accuracy and hallucination risk of the model at different reasoning lengths are respectively counted, and then the area under the curve formed by the two is calculated.

The higher the RH - AUC, the better the model maintains its visual alignment ability while enhancing reasoning - it can both "think deeply" and "see clearly".

The experimental results reveal three key trends:

1. Larger - scale models are more robust: As shown in Figure (a), the 7B model shows a smoother RH - AUC curve at different thinking depths and achieves a higher score at the peak, indicating its stronger ability to integrate reasoning and perception.

2. The RL - only training paradigm is better than SFT + RL: As shown in Figure (b), under different training strategies, the average RH - AUC of models trained with pure RL is higher than that of the mixed paradigm, especially under the condition of a long reasoning chain (0.57 vs 0.50).

This shows that RL - only is more inclined to adaptively generate high - quality reasoning paths, while SFT + RL is more likely to fall into redundant imitation, thus interfering with perception judgment.

3. The "type" of data is more important than the scale: The experiment found that instead of blindly expanding the scale of the training set, introducing a small number of samples with domain - aware features (such as mathematical reasoning or image perception tasks) is more helpful to guide the model to achieve a balance between "looking at the picture" and "thinking".

RH - AUC not only fills the gap in the evaluation dimension but also provides a clearer reference direction for the training goals of future multimodal models: more reasoning is not always better. Maintaining the tension between "seeing the image" and "understanding the problem" is a better paradigm.

Reference materials:

https://arxiv.org/pdf/2505.21523 

This