HomeArticle

As difficult as stuffing an elephant into a refrigerator, is the edge-side large model a gimmick or the future?

极客邦科技InfoQ2025-10-14 16:29
As the development of large model technology enters the deep end, the experience, cost, and privacy of AI applications are becoming increasingly critical issues.

As the development of large model technology enters deeper waters, the experience, cost, and privacy of AI applications are becoming increasingly crucial issues. If large models can be directly deployed on the terminal side, it will undoubtedly be highly attractive to industrial applications. So, how can we overcome the huge model size and computational complexity when deploying large models on the terminal side?

Recently, the live broadcast program "Geek Gathering" by InfoQ in collaboration with AICon specially invited Dr. Zhu Shiai, the person in charge of the xNN engine at Ant Group and a researcher at the Alipay Multimodal Application Laboratory as the host. Together with Dr. Xu Mengwei, an associate professor and doctoral supervisor at Beijing University of Posts and Telecommunications, and Zhang Wu, an ecological technology expert for terminal-side CANN at Huawei, they discussed the current development status of terminal-side large models, the tangible progress made, and future opportunities on the occasion of the upcoming QCon Global Software Development Conference 2025 Shanghai Station.

Some of the wonderful viewpoints are as follows:

  • Whether in the era of traditional AI or large models, the advantages of real-time performance, privacy, and cost have laid the foundation for the development of terminal-side AI. Although compared with the past, the scale and computing power requirements of large models make terminal-side deployment more challenging, with the development of generative AI, terminal-side AI should also continuously break through technological bottlenecks along this trend.
  • Large models will gradually sink to become system-level services of the operating system. As their role in terminals and intelligent hardware becomes more important, the operating system also needs to adapt to this change.
  • The advantage of terminal-side AI lies in privacy protection and rapid response, while cloud-side AI excels in leveraging big data and powerful computing power. Combining the characteristics of both, terminal-cloud collaboration is undoubtedly an ideal solution.
  • If one really wants to start a business and seek a higher ceiling, relying solely on large models themselves will be quite difficult. Developing large models is of course very important, but to independently support the development of a company, it is necessary to combine practical scenarios, such as application development, intelligent agents, drones, or other deeply vertical fields.

The following content is based on the live broadcast transcript and has been abridged by InfoQ.

How to deploy terminal-side large models?

Zhu Shiai: What is the current development status of terminal-side large models, and what tangible progress has been made?

Xu Mengwei: Terminal-side large models refer to directly deploying the operation (mainly inference) of large models on terminal devices. In contrast, the current mainstream closed-source SOTA models usually run on large GPU clusters or data centers in the cloud, and complete inference through remote requests. This process is usually called serving. The scope of the "terminal side" is very broad, ranging from IoT devices with weak computing power and only low-performance CPUs, to smartphones with medium computing power, and then to robots and PCs. As long as the devices are not placed in remote data centers or edge clouds, they can be called terminal-side AI.

As early as two or three years ago when our team started researching terminal-side large models, many people thought that a "model placed on the terminal" was not considered "large." In fact, there is no unified standard for large models. Personally, I believe that as long as it is an autoregressive model based on decoder-only, Transformer (including attention or a lighter Linear attention), with a parameter scale exceeding 100 megabytes, it can be called a large model. In other words, as long as it is a foundation model that can handle multiple tasks and can adapt to different downstream tasks through simple fine-tuning, it can be called a large model.

Why deploy large models on the terminal side? Although it is very convenient to call cloud APIs, terminal-side deployment has several advantages, similar to traditional terminal-side AI. Firstly, it is about privacy. In the era of large models, the model may utilize almost all the data generated on the terminal, including recordings, texts, inputs, screen clicks, etc. Therefore, the privacy issue is more prominent than ever. Secondly, terminal-side inference can get rid of network dependence and improve availability, and can even run offline. At the same time, it avoids the network round-trip delay (RTT) and the delay problems caused by batch scheduling in cloud serving. If optimized properly on the terminal, the overall delay can be significantly reduced. Finally, from the perspective of enterprises, distributing the computing to user terminals can reduce the cost of maintaining super-large GPU clusters, which is also a strong business motivation.

In the academic community, terminal-side large models are not a single field but a multi-disciplinary technology under a specific scenario. Researchers can be roughly divided into three categories: one is the algorithm direction, focusing on large model lightweighting, pruning, distillation, and new architecture design; the other is the software direction, involving system software, high-performance computing, mobile or edge computing, and embedded systems; the third is the hardware architecture direction, focusing on circuit and accelerator design. It is difficult to compete with NVIDIA in the cloud field, but there are more opportunities in terminal-side research.

Zhang Wu: I will talk about the challenges of migrating cloud-side large models to the terminal side from several dimensions. Firstly, it is the memory issue. The memory in the cloud can be almost infinitely expanded, while the memory configuration of terminals such as mobile phones is mostly 8 - 12GB. Therefore, the inference accuracy of BF16 on the cloud side cannot be inherited on the terminal side, and it is necessary to adapt to the limited memory through extreme quantization and compression. Secondly, it is the accuracy alignment. Cloud-side models usually do not require quantization, while on the terminal side, the FP32 model must be compressed to 4 bits or even lower. The support for quantization algorithms varies greatly among different manufacturers, bringing challenges to accuracy alignment. In addition, there is also the cost issue of development and adaptation. On the cloud side, one only needs to optimize the PyTorch project for partial inference deployment to quickly go online, while on the terminal side, one has to start almost from scratch. Manufacturers need to develop their own high-performance operators to build inference capabilities, and the development cost is much higher than that on the cloud side.

In response to these problems, Huawei's CANN toolchain provides developers with a solution for quickly deploying AI models on the terminal side. Firstly, it is quantization and memory optimization. The CANN toolchain provides a low-bit quantization algorithm friendly to NPUs, which significantly reduces the model memory footprint and enables large models to run on terminals such as mobile phones. Secondly, it is the ability to customize operators. Due to the different quantization algorithms of various manufacturers, mobile phone manufacturers cannot adapt to them one by one. The CANN toolchain supports the development of custom operators in Ascend C. For example, when we cooperated with Alipay, they hoped that the NPU quantization accuracy could be consistent with the CPU implementation accuracy. The Alipay team implemented the QuantMatmul of its own algorithm for the NPU version through Ascend C, ultimately ensuring the accuracy consistency. At the same time, Ascend C operators are unified across the terminal and cloud, supporting one-time development and multi-terminal deployment, which greatly improves the development efficiency of subsequent cross-platform porting. Finally, it is the support for model generalization. The CANN toolchain has provided adaptation solutions for the mainstream open-source models in the industry (such as Tongyi, Qianwen, LLaMA, ChatGLM, etc.) and continuously optimizes large model scenarios such as multi-modal and MoE.

Zhu Shiai: As an Internet giant, Alipay focuses on three major advantages in practical applications: real-time performance, privacy, and cost.

Firstly, in the era of large models, interactive scenarios such as voice assistants, streaming recognition, and real-time translation have increasingly high requirements for latency. Terminal-side inference can avoid network transmission and cloud computing overhead, significantly improving the response speed. Secondly, the development of Alipay's terminal-side AI originated from the traffic peak brought by the "Spring Festival Five Blessings" promotion event. At that time, the cloud load alone was difficult to support, which gave rise to the X engine and laid the foundation for the status of terminal-side AI in Alipay. Now, in the era of large models, the computing pressure is greater, and terminal-side AI can effectively share the peak traffic and computing cost. Thirdly, personalized recommendation and real-time decision-making based on user behavior often involve sensitive data. If these algorithms are implemented on the terminal side, the data risk can be reduced, and the user experience can be improved at the same time.

Therefore, whether in the era of traditional AI or large models, the advantages of real-time performance, privacy, and cost have laid the foundation for the development of terminal-side AI. Although compared with the past, the scale and computing power requirements of large models make terminal-side deployment more challenging, with the development of generative AI, terminal-side AI should also continuously break through technological bottlenecks along this trend.

Zhu Shiai: From "usable" to "good to use": What is the biggest challenge in the process of deploying terminal-side large models? What challenges are currently "choking" the development?

Zhang Wu: "Putting a large model into a mobile phone" is as difficult as "putting an elephant into a refrigerator." We can talk about the technical challenges of integrating large models into the terminal from the following dimensions:

  • Fit in: Currently, the mainstream flagship mobile phones are still in the stage of starting with 8 - 12GB of memory. Therefore, the primary problem to be solved for integrating large models into the terminal is the memory footprint. With the limited incremental application memory (about 500MB) and the necessary large model parameter scale (0.5 - 1B), we must provide some memory slimming technologies to compress the model within a controllable range as much as possible. Currently, the CANN toolchain provides solutions such as low-bit quantization and Embeding In Flash, which control the actual memory footprint of the model to less than 50% of the parameter memory.
  • Run fast: The core value of terminal-side AI lies in privacy protection and low latency. In the large model scenario, in order to provide developers with a fast large model response experience on the terminal side, the affinity quantization algorithm provided by our CANN offers mixed-bit quantization capabilities that can fully utilize the computing power of the NPU. At the same time, the CANN toolchain also provides technologies such as Flash Attention and Prompt Cache to further optimize the terminal-side inference time. Models around 1B in the open-source ecosystem can currently achieve a fast response experience of 1000 tokens per second.
  • Function generalization: Different from the unified Python ecosystem on the cloud side, the current software ecosystem of terminal-side AI is not unified. During the process of migrating large models from the cloud to the terminal, developers have to complete the terminal-side AI adaptation process starting from scratch, and the debugging workload is relatively large. The CANN toolchain has generalized the functions for the mainstream open-source models in the industry. Currently, it supports a series of open-source large models such as Qwen, Llama, and ChatGLM, helping developers quickly complete the process from 0 to 1. At the same time, we are also actively adapting and optimizing new large model technologies such as multi-modal, full-modal, and MOE. The ASC custom operator programming ability we provide allows developers to adjust the operator calculation optimization strategy according to business needs. In the future, we will gradually open up the large model docking solution, providing corresponding model examples, instruction manuals, and adaptation code demos to help developers quickly integrate large model capabilities on the mobile phone side.

Xu Mengwei: The integration of large models and the operating system is an interesting and promising topic. Although most technical personnel recognize this trend, its progress is still slow. Currently, there is a lack of mature results, and most are just visions for the future. We believe that large models will gradually sink to become system-level services of the operating system. As their role in terminals and intelligent hardware becomes more important, the operating system also needs to adapt to this change.

In the future, applications may gradually evolve into agents, calling large models at the bottom layer. The power consumption and memory footprint of large models may reach 90% or even higher, which means that the operating system needs to redefine resource management. For example, should the KV cache of large models be managed in the same way as APP memory? If a unified mechanism is adopted, it may lead to huge recompute overhead, and the existing Low Memory Killer strategy of the system is not suitable for this scenario.

Resource scheduling is also a challenge. Currently, the NPU architecture is relatively simple, lacking a flexible scheduling mechanism similar to that of the CPU. If multiple applications use the NPU simultaneously in the future, how to achieve isolation, preemption, and scheduling will be new problems. The academic community has just started researching these topics, while the industry currently focuses more on how to improve the inference performance first. However, with the increasing requirements for low power consumption and high efficiency on the terminal side, the highly general characteristics of large models will significantly enhance the value of the NPU.

In the era of traditional CNNs, the NPU was not fully utilized due to fragmentation and other issues, and many tasks could meet the performance requirements on the GPU or CPU. However, in the era of large models, the importance of the NPU will be significantly enhanced. In addition, the roles and cooperation models of mobile phone manufacturers and application manufacturers will also affect the development of the ecosystem, thereby influencing the research direction and result transformation of technologies.

Zhu Shiai: From an application perspective, the challenges faced by terminal-side models in actual business stem from the gap between their capabilities and expectations. The capabilities of the terminal side must be limited to complete tasks within a "controllable range," mainly referring to inference tasks with references, such as summarization, translation, automatic speech recognition, and tool calls such as function call and MCP. In the future, to solve these problems, a "terminal-cloud combination" solution will still be needed, which will also become an important technical direction.

From the perspective of the APP side, we also face some special problems that terminal manufacturers do not have, one of which is the deployment and distribution of models. Firstly, the installation package of the APP usually cannot be too large. However, even after low-bit quantization of the model, the scale may still reach several hundred megabytes. If such a large model is to be distributed to the mobile phone side, even without considering the occupation of other APPs, the memory pressure of the Alipay APP itself cannot be ignored. Whether it is possible to achieve instant access, loading, and initialization of the model while ensuring the user experience is an important challenge.

In response to this, Alipay is trying various solutions. Firstly, consistent with the ideas of many terminal manufacturers, we prioritize using lower-bit quantization to minimize the model size as much as possible. Secondly, we focus the model parameter scale in the range of 050 million to 150 million. At the same time, in the terminal framework and engineering architecture, originally each business independently called the inference engine, but now it is gradually evolving into a unified large model runtime management framework, similar to a "terminal-side AI container."

Zhu Shiai: In response to these challenges, what are the current breakthrough ideas and technical solutions in the industry? Please share some of the technical solutions and practical experiences you are using or think have the most potential.

Xu Mengwei: The memory capacity of mobile phones and various terminals is very limited, and it is difficult to achieve a significant increase in the short term, which will restrict the scale of terminal-side large models. However, I believe that we can draw on the concept of the "pyramid" storage structure of computers and break through the limitations through more refined storage management. Computers form a hierarchical system from cache to memory and external storage, which is based on the locality of time and space, so the cache technology is effective.

Large models also have similar locality characteristics: some parameters are frequently activated, while others are rarely used. This may come from the MoE structure during training, where the attention parameters are almost always activated, while the call frequency of most experts is relatively low. Even if trained without the MoE method, the activation of the standard Transformer also shows a hot and cold distribution, with some parameters being frequently active and others rarely. Research also shows that this sparsity can be enhanced through post-processing.

This sparsity can be combined with storage hierarchies: keep the frequently activated parameters in memory, and load the infrequently used parameters on demand. If the number of cold parameters is small enough or the IO speed is fast enough, it can overlap with the calculation, thus hardly increasing the time overhead.

This idea is not only applicable to the terminal side. On the cloud side or larger devices, the exchange between main memory and video memory is also a commonly used technology. Utilizing sparsity on the terminal is also very important, which is a key way to expand the scale of models that can run on the terminal side. The MoE structure is naturally suitable for terminal chips due to its lower computing power requirements.

Our team has been trying to combine model sparsity with the storage structure to improve the ability to efficiently deploy models on the NPU. Although it seems straightforward, in fact, optimizing the deployment on different NPUs is still a difficult problem. Last year, we published an SDOS paper introducing our work on achieving end-to-end NPU inference on a commercial mobile phone, using a Qualcomm chip at that time. This year, we are also researching the Huawei platform. The new SoC released by Qualcomm in 2023 claims to be designed for generative AI, but we found that there is still a large gap between it and the Transformer model, such as the lack of support for dynamic shapes and group-level quantization. Traditional CNNs run well, but there are still gaps in the hardware interface layer for large models.

Therefore, we optimize from the perspective of collaborative design of algorithms and systems. Although we cannot completely solve the problems, we can alleviate some bottlenecks and improve efficiency. However, even after various optimizations, the decode stage is still often limited by the memory bandwidth, and the ideal acceleration has not been achieved. We can combine speculative computing and other means, but the system complexity also increases accordingly. In general, to fully extract