StartseiteArtikel

Skip 88% of experts and retain 97% of performance. The right way to play Mixture-of-Experts (MoE) inference.

新智元2026-03-05 18:00
MoDES: Multimodal large model skips 88% of experts, retains 97% of performance, and speeds up inference by 2 times.

[Introduction] A new CVPR study on MoDES boosts the inference efficiency of multimodal large models: without training, it can intelligently skip 88% of redundant experts while retaining 97% of the performance, completely breaking the old perception that "skipping more experts necessarily leads to performance degradation", and doubling the inference speed.

Multimodal large models are rapidly moving towards large-scale development. To handle higher-resolution images, longer video sequences, and more complex cross-modal tasks, the scale of model parameters continues to grow.

The Mixture-of-Experts (MoE) architecture has become the mainstream choice: by activating only a partial set of expert networks, it aims to reduce computational overhead while maintaining the model scale.

However, the problem lies in the fact that — even with MoE, the inference cost of multimodal models remains high.

Each token still needs to interact with multiple experts, and a large amount of computation occurs on experts that are "not truly critical". Although MoE avoids "fully activating all parameters", it fails to truly achieve "computation on demand".

In scenarios such as video understanding or long context processing, this redundancy is quickly magnified, becoming a bottleneck for inference.

Thus, a natural question arises: Can redundant experts be dynamically skipped during the inference phase?

Existing expert skipping methods have achieved certain results on pure-text LLMs. However, when directly applied to multimodal models, they often lead to significant performance degradation. The more experts are skipped, the more severe the performance drop, and the model may even crash under high skipping ratios.

A research team from institutions such as the Hong Kong University of Science and Technology, Beihang University, and Peking University proposed MoDES (Multimodal Dynamic Expert Skipping). They systematically analyzed the root cause of the failure of multimodal MoE skipping and presented a training-free dynamic expert skipping framework for multimodal MoE. This work has been accepted by CVPR.

Paper link: https://arxiv.org/pdf/2511.15690 

Code link: https://github.com/ModelTC/MoDES

On Qwen3-VL-MoE-30B, MoDES retains 97.33% of the original performance while skipping 88% of the experts. Meanwhile, it brings significant inference acceleration, breaking a long-standing consensus that high-proportion expert skipping necessarily leads to unacceptable performance loss.

Figure 1 Performance comparison between MoDES and existing methods on 13 benchmarks under different skipping ratios

MoDES does not directly propose new rules. Instead, it first answers a more fundamental question: Why do skipping methods designed for text models significantly fail on multimodal MoE?

The paper presents two key observations.

The global contributions of experts at different layers to the final output are highly imbalanced: Existing skipping methods usually judge the importance of experts only based on the routing probability of the current layer, ignoring a key fact that the influence of experts at different layers on the final prediction distribution varies greatly.

Experiments show that when reducing the number of routed experts, reducing experts in the shallow layers leads to more significant performance degradation, while the impact of reducing experts in the deep layers is relatively small. This means that errors in the shallow layers are gradually magnified in subsequent layers, leading to performance collapse.

In other words, the importance of experts is not only a matter of "local routing probability" but also of "the degree of influence on the final output". If a layer-independent unified rule is adopted, it is easy to skip too many experts in the critical shallow layers. The relevant phenomenon is shown in Figure 2.

Figure 2 Performance changes after reducing experts in different layer ranges

There are significant differences in the behavior of text tokens and visual tokens: The paper further analyzes the modal differences. Through visualizing and statistically analyzing the token representations before and after the FFN, the researchers found that: text tokens have a significantly larger update amplitude in the FFN; visual tokens are closer to being orthogonal to expert weights; experts have relatively less influence on visual tokens.

This means that experts are more crucial for text reasoning, while there is higher redundancy for visual tokens. If the skipping strategy does not distinguish between modalities, it is likely to mistakenly delete experts that are crucial for text understanding, leading to performance degradation. The relevant analysis is shown in Figure 3.

Figure 3 Analysis of the differences between text and visual tokens in the FFN

These two observations together point to a core conclusion: The importance of experts in multimodal MoE needs to be both output-aware and modality-aware.

Output-aware + Modality-aware Dynamic Skipping Framework

Based on the above insights, MoDES constructs an output-aware and modality-aware dynamic expert skipping mechanism. The overall process is shown in Figure 4.

Figure 4 Framework diagram of MoDES

First, MoDES introduces a hierarchical global importance factor on the basis of the original routing probability

to describe the overall impact of experts at the l-th layer on the final output distribution.

This factor is obtained through offline calibration, that is, by comparing the differences in the model output distribution before and after removing the experts at that layer, thereby quantifying the global contribution of the experts at that layer. The new expert importance score is jointly determined by the local routing probability and the global factor. In this way, experts in the shallow layers will be more conservatively retained, while experts in the deep layers can be skipped more aggressively, achieving true output-aware skipping.

Second, MoDES introduces a dual-modal threshold mechanism, setting different skipping thresholds for text tokens and visual tokens respectively. By distinguishing between modalities, the expert skipping decision becomes more refined, avoiding the accidental deletion of key experts.

Finally, to efficiently find the optimal threshold combination, MoDES designs a frontier search algorithm. Utilizing the monotonicity between performance and the skipping ratio, it reduces the search complexity from

to

shortening the search time by approximately 45 times while ensuring result consistency.

Figure 5 Comparison of calibration and search time

Experimental Results

In the main experiment, QVGen on W4A4/W3A3. In large-scale experiments, MoDES was systematically evaluated on multiple mainstream multimodal MoE models.

On Kimi - VL - A3B - Instruct, when skipping 83% of the experts, the average performance of most existing expert skipping methods drops by more than 11%, while MoDES still retains 96.25% of the original performance (see Figure 6). This result shows that high - proportion skipping does not necessarily lead to performance collapse. As long as the importance of experts is accurately modeled, redundant experts can be effectively identified.

On the larger - scale Qwen3 - VL - MoE - 30B - A3B - Instruct, the advantages of MoDES are even more obvious. Under the condition of skipping 88% of the experts, MC - MoE only retains 86.66% of the performance, DiEP retains 85.30%, while MoDES can still retain 97.33% of the original performance (see Figure 7). On 13 image and video understanding benchmarks, MoDES achieved the best or nearly the best performance.

Figure 6 Performance comparison of Kimi - VL under different skipping ratios

Figure 7 Performance comparison across backbones

This result shows that high - proportion skipping is not infeasible. The key lies in whether the global contribution of experts to the final output and the behavioral differences of tokens in different modalities can be correctly modeled.

Inference Efficiency and Quantization Compatibility

In actual inference tests, MoDES achieved significant acceleration on the H200 GPU. It obtained approximately 2× acceleration in the Prefill stage and still had approximately 1.2× improvement in the Decoding stage (see Figure 8). Since MoDES is a training - free method and does not introduce additional computational overhead during the inference stage, the acceleration effect is more stable.

In addition, MoDES has good compatibility with mixed - precision quantization. It can still maintain high performance under low - bit quantization conditions, indicating that skipping and quantization can complement each other from both the structural and numerical levels, jointly reducing the computational cost of multimodal MoE.

Figure 8 Comparison of inference speed. (Top) Qwen3 - VL; (Bottom) Kimi - VL.

Summary

The core contribution of MoDES is to propose a truly output - aware and modality - aware multimodal expert skipping mechanism.

By explicitly modeling the global contribution of experts at different layers to the final output distribution and the update characteristics of tokens in different modalities in the expert network, MoDES proves an important thing: even if more than 80% of the experts are skipped, as long as the skipping is "smart" enough, the model performance can still be stably maintained.

In the context of the continuous expansion of multimodal model scale, this skipping idea based on output - impact modeling provides a more robust and implementable path for optimizing the inference efficiency of large models.

Reference: https://arxiv.org/pdf/2511.15690  

This article is from the WeChat public account "New Intelligence Yuan", edited by LRST. Republished by 36Kr with permission.