Running MoE on Mobile Phones: Meta Proposes MobileMoE, Speeding Up iPhone 16 Pro by 3.8 Times

Save more memory and run faster

In recent years, Mixture of Experts (MoE) models have been widely used in large cloud-based models. However, on mobile phones, Large Language Models (LLMs) still mainly adopt dense architectures. In the past, mobile devices had more stringent constraints on memory, computing power, and latency, and there had been a lack of systematic research on edge-side MoE within the range of sub-billion active parameters. Now, with the increase in the DRAM capacity of mobile devices, MoE also has the opportunity to be deployed on smartphones.

The MobileMoE proposed by the Meta team has achieved efficient MoE inference on commercial smartphones for the first time. The results show that in 14 basic tests, with similar memory usage, MobileMoE-S/M only uses 1/2 to 1/4 of the inference computation of the dense baseline, and achieves comparable or even higher average accuracy. In actual tests, MobileMoE-S shows the most significant speedup on the GPU/MLX backend of the iPhone 16 Pro, with a maximum speedup of 3.8 times in the input stage.

Paper link: https://arxiv.org/abs/2605.27358

The research team also proposed a set of edge-side MoE scaling rules to determine the model structure more suitable for mobile phone deployment. MobileMoE has established a new Pareto frontier for edge-side large language models, achieving better results in the trade-off between accuracy and inference computation overhead.

Figure | MobileMoE has established a new Pareto frontier for edge-side large language models.

How is MobileMoE designed?

MobileMoE can be understood in this way: it is a type of MoE language model designed for edge-side deployment. The overall structure is still a decoder-only Transformer, but the original dense feed-forward layer is replaced with a MoE layer. The router selects a small number of experts with the highest scores for each token to participate in the calculation, and there is also a shared expert that always participates in the calculation. The entire training process is divided into four steps: pretraining, mid-term training, supervised fine-tuning, and quantization-aware training.

Pretraining: The research team conducted pretraining using approximately 6T tokens of openly licensed data with a context length of 2048. The data is mainly from the Web, covering fields such as mathematics, code, knowledge, and science.

Mid-term training: The research team extended the context length to 8192 and further increased the proportion of high-quality data in fields such as knowledge, code, mathematics, and science, with a total scale of approximately 500B tokens.

Supervised fine-tuning (SFT): The research team fine-tuned MobileMoE-Base on more than 80 million samples of openly licensed instruction fine-tuning data.

Quantization-aware training: The research team quantized the linear layer and embedding to INT4, dynamically quantized the activation to INT8, and kept the router at FP32 precision.

Figure | The four-stage training of MobileMoE.

Experimental results

Ablation experiment results

The research team first compared three architectural variables: the number of experts E, the expert granularity g, and whether to add a shared expert.

Figure | Scaling of the number of experts E.

Under a fixed memory budget, when the memory is higher than approximately 0.25GB, the loss of MoE begins to be lower than that of the corresponding dense model. Continuing to increase the number of experts E will further reduce the loss, but when E increases to 8, the marginal benefit is significantly weakened. The experiment on the expert granularity g shows that a finer-grained expert configuration is generally better, and g = 8 achieves a good balance between the effect and the training overhead; when g increases from 8 to 16, the loss improvement is less than 0.01, but the training duration increases by approximately 50%. Under the same computation budget, adding a shared expert further reduces the model loss.

Based on the ablation experiment results, the research team finally adopted the configuration of E = 8, g = 8, and a shared expert, that is, 60 fine-grained routing experts, Top-4 routing, and 1 shared expert, and applied this structure to the three versions of MobileMoE-S/M/L.

Figure | Scaling the MoE model under optimal computation conditions.

Figure | Training efficiency of the MoE architecture.

14 basic evaluations: Establishing a new edge-side Pareto frontier

The research team re-evaluated MobileMoE together with models such as Gemma 3, SmolLM2, Qwen3.5, OLMo 2, and OLMoE-1B-7B under unified settings in five categories of 14 basic evaluations, including commonsense reasoning, knowledge, science, reading, and reasoning.

Figure | Pretraining trajectory of MobileMoE.

The comparison results of the Base models show that the average score of MobileMoE-M is higher than that of Qwen3.5 2B, and the average score of MobileMoE-L is higher than that of OLMoE-1B-7B, with a smaller required model scale. The research team also mentioned that the average score of the Base version of MobileMoE-L is already higher than that of the Instruct version of OLMoE-1B-7B. In terms of training scale, MobileMoE uses approximately 6T pretraining tokens, less than the 9T of Llama 3.2 1B and the 11T of SmolLM2 1.7B. In the overall comparison of instruction fine-tuned models, the average accuracy of MobileMoE-M is already close to that of OLMoE-1B-7B, but both the active parameters and the total parameters are approximately 60% less.

Figure | Comparison of MobileMoE-Base models.

Advanced evaluations: More obvious advantages in code and math tasks

In the advanced evaluations after instruction fine-tuning, MobileMoE performs more prominently in code and math tasks. Taking MobileMoE-L as an example, its average scores in both code and math evaluations are higher than those of Qwen3.5 2B and OLMoE-1B-7B. However, the research team also mentioned that in terms of instruction following and knowledge reasoning abilities, Qwen3.5 2B is still stronger.

Figure | Comparison of Instruct models in advanced benchmark tests.

Quantization and edge-side deployment: Remaining competitive after INT4, significant speedup on mobile phones

After quantization, the overall average scores of MobileMoE-S/M/L decrease compared to their respective BF16 versions, but the decrease is roughly between 2 and 3 points. Even so, the INT4 version of MobileMoE-L still performs better than the BF16 version of OLMoE-1B-7B Instruct.

The research team also deployed MobileMoE on the Samsung Galaxy S25 and iPhone 16 Pro for testing. The results show that under comparable INT4 weight memory conditions, compared with MobileLLM-Pro, MobileMoE-S achieves a speedup of 1.8 - 3.8 times in the input stage and a speedup of 2.2 - 3.4 times in the per-token generation stage.

In terms of memory usage, under the conditions of the Samsung Galaxy S25, 8K context, and real prompts, the peak RSS of MobileMoE-S is 1.49GB, lower than the 1.91GB of MobileLLM-Pro.

Figure | Edge-side runtime latency.

Deficiencies and future directions

Currently, in terms of higher-order instruction following and knowledge and reasoning abilities, the instruction fine-tuned MobileMoE still lags behind Qwen3.5 2B. The research team believes that this gap may be related to more comprehensive post-training. In the future, to narrow this gap, the training side needs to strengthen distillation, post-training for reasoning, and multimodal expansion.

In addition, the research team pointed out that the memory usage of MoE on mobile phones varies with the input content. Compared with fixed template inputs, real inputs usually result in higher memory usage. If only template-based inputs are used for testing, the memory pressure in actual deployment scenarios may be underestimated. In the future, to more accurately evaluate the real memory performance of edge-side MoE, more real-world test data is still needed.

Meanwhile, the research team has completed systematic real-machine tests on the CPU and GPU backends, but the NPU route still needs to be explored. At the same time, the runtime memory usage of MoE is quite sensitive to the input content. In the future, dynamic routing, expert pruning, mixed-precision quantization, and mobile NPU deployment are all directions for further improving edge-side efficiency.

For more technical details, please refer to the original paper.

This article is from the WeChat official account “Academic Headlines” (ID: SciTouTiao), author: Xia Qiansi, published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。