Unveiling the "Scam" in Attention Mechanism of Multimodal Large Models: Correction with a Formula

There is a significant bias in Attention within VLMs, and direct pruning may lead to unexpected consequences.

Is Attention Really Reliable?

In recent years, Vision-Language Models (VLMs) have made significant progress in multimodal understanding tasks, especially in scenarios such as visual question answering, image understanding, and video understanding. Models typically use language-to-vision attention to measure the correlation between visual tokens and text and perform visual token pruning accordingly to reduce inference costs and improve operational efficiency.

However, a long-neglected question is: Can attention itself really serve as a reliable indicator of "semantic importance"?

In the latest research, the team led by Dan Zeng from Shanghai University systematically analyzed the behavior patterns of attention in mainstream VLMs and discovered a crucial but easily overlooked phenomenon - attention is not solely determined by semantics but is significantly affected by structural biases. If these biased attentions are directly used for visual token pruning, it often inadvertently retains unimportant visual regions while losing key information that truly contributes to task understanding.

Two Core Sources of Attention Bias

1. Recency Bias: Attention Prefers "Later Tokens"

Through statistical analysis of a large number of samples, the team found that the language-to-vision attention shows an obvious monotonically increasing trend as the position of the visual token in the sequence increases, which means that the model tends to focus more on visual tokens at the end of the sequence.

In images, this phenomenon often manifests as the model giving higher attention to the lower part of the image, and this preference has no direct relation to the image semantics itself, as shown by the curve in the relevant visualization results.

More seriously, when attention is used for visual token pruning, this positional bias is further amplified, resulting in the pruning results systematically retaining visual tokens that are "positionally later but semantically irrelevant".

2. Padding Attention Sink: Why Do Blank Areas Receive High Attention?

In addition to positional bias, the team also observed another more subtle problem: the attention in the padding area is abnormally high. In many VLMs, due to inconsistent input image sizes, padding is an inevitable operation, but these areas do not contain any useful semantic information.

Nevertheless, the research found that the visual tokens corresponding to padding often receive abnormally large weights in attention calculation. The root cause lies in the extreme activation values in the hidden state, which induces the so-called attention sink phenomenon. This directly misleads the attention-based pruning strategy and causes the model to incorrectly retain blank areas.

Core Idea: Debiasing Attention Itself

In response to the above problems, the team led by Dan Zeng from Shanghai University did not propose a new pruning method or introduce an additional training process. Instead, they started from a more fundamental perspective: Since attention itself is biased, can we first correct the attention?

The core observation of the team is that the bias in attention is not random noise but shows a stable and modelable overall trend. Therefore, the researchers explicitly modeled the positional bias by fitting the overall trend of attention changing with token position and then debiased and corrected the original attention on this basis, effectively weakening the position factors unrelated to the content and making the attention closer to the real semantic correlation.

Meanwhile, for the padding area, the team explicitly suppresses its attention contribution during the pruning stage to avoid the interference of the attention sink on token sorting. The entire process does not involve modifying the model structure or retraining and can be directly used in the inference stage.

Experimental Results

In systematic experiments, the team integrated the attention debiasing strategy as a plug-and-play module into various mainstream attention-based visual token pruning methods for evaluation. The experiments covered 6 pruning baselines, were tested on multiple mainstream VLMs (7B/13B), and were verified on 10 image understanding tasks and 3 video understanding tasks.

The experimental results show that in almost all settings, after attention debiasing correction, the pruned models achieved stable performance improvements, and the effect was particularly obvious under more aggressive token compression conditions.

Conclusion

The research results show that attention is not naturally equivalent to semantic importance. In Vision-Language Models, if the inherent structural bias in attention is ignored, the attention-based pruning strategy is easily misled, which affects the overall performance of the model.

By simply and effectively debiasing attention, the team led by Dan Zeng from Shanghai University significantly improved the reliability and generalization ability of visual token pruning without introducing additional training costs. This work provides a new perspective for the efficient deployment of multimodal models and lays the foundation for the subsequent design of more robust attention mechanisms.

Article Link: https://arxiv.org/abs/2508.17807

Article Code: https://github.com/intcomp/attention-bias

Authors: Shanghai University, Nankai University. Kai Zhao¹, Wubang Yuan¹, Yuchen Lin¹, Liting Ruan¹, Xiaofeng Lu¹, Deng-Ping Fan², Ming-Ming Cheng², Dan Zeng¹. ¹School of Communication and Information Engineering/School of Computer Engineering and Science, Shanghai University; ²School of Computer Science, Nankai University

This article is from the WeChat official account "QbitAI", author: Intcomp Team. It is published by 36Kr with permission.