NVIDIA und DeepSeek folgen kollektiv nach. Vor 18 Monaten wurden sie ignoriert, jetzt beherrschen sie die KI-Inferenz.
In 2024, the team of Jin Xin and Liu Xuanzhe from Peking University, along with the "Hao AI Lab" at the University of California, San Diego, proposed the concept of DistServe decoupled inference. In just over a year, it quickly evolved from a laboratory concept to an industry standard and was adopted by mainstream large - model inference frameworks such as NVIDIA and vLLM, indicating that AI is moving towards a new era of "modular intelligence."
If the "Moore's Law" suggests that computing power doubles every 18 months, the current rate of decline in large - model inference costs far exceeds the prediction of Moore's Law regarding the iteration speed of computing power.
This is not only due to the improvement in chip performance but, more importantly, to the self - evolution of the inference system. What accelerates this evolution is the concept of "decoupled inference" first proposed and implemented in the DistServe system.
The system was launched in March 2024 by institutions such as Peking University and UCSD, and put forward a simple yet bold idea:
Split the inference process of large models into two stages: "prefill" and "decode," and let them scale and be scheduled in independent computing resource pools respectively.
Today, this decoupled inference architecture has been adopted by mainstream large - model inference frameworks such as NVIDIA, llm - d, vLLM, and MoonCake, and has begun to demonstrate its powerful capabilities in large - scale, real - world inference scenarios.
The "Hao AI Lab" is led by Assistant Professor Hao Zhang from the University of California, San Diego, who is also the recipient of the Google Machine Learning and Systems Young Faculty Award in 2025.
Assistant Professor Hao Zhang from the University of California, San Diego
In 2025, the "Hao AI Lab" team also received a DGX B200 system donated by NVIDIA to strengthen the AI research infrastructure.
The "Hao AI Lab" team received the DGX B200 system donated by NVIDIA
The team of Hao Zhang, as the original designers of "decoupled inference," detailed how the "prefill - decode disaggregation" architecture evolved from a research concept to a production system and how decoupled inference will evolve in the context of the continuous expansion of large - model inference.
From Co - located Deployment to Decoupled Inference
Before the emergence of DistServe, most inference frameworks adopted the "co - located deployment" approach:
That is, both the "prefill" and "decode" stages were executed simultaneously on the same GPU.
In each inference iteration, the scheduler would package multiple user requests into a batch as much as possible, run a round of calculations, and then generate an output token for each request.
This technique, known as "continuous batching," was first proposed by Orca and later popularized by vLLM.
This method once became the standard practice in the industry due to its advanced nature, but it also has two fundamental limitations.
One is interference.
Since "prefill" and "decode" share the same GPU, their latencies will inevitably interfere with each other.
Even with mitigation measures such as "chunked prefill," a large prefill request can still inflate the TPOT (time - per - output - token) by 2 to 3 times, especially during sudden spikes in load.
As shown in the above figure (top), when prefill (orange) and decode (blue) are co - located, they interfere with each other, causing the decode stage to stall; in the above figure (bottom), when prefill and decode are separated onto different machines, they can run smoothly without interference.
The other is coupled scaling.
In real - world production environments, enterprise - level applications usually consider TTFT (time - to - first - token) and TPOT as key user - experience latency indicators.
When prefill and decode are deployed on the same set of GPUs, the resource allocator must meet the latency requirements of both worst - case scenarios simultaneously.
This means that the system needs to over - reserve resources, resulting in low utilization of computing resources and poor overall efficiency.
As the deployment scale continues to expand and the latency requirements become more stringent, the costs associated with these two issues also increase significantly.
It is these real - world pain points that have driven the emergence of DistServe.
DistServe completely eliminates the interference between prefill and decode by splitting them into independent computing pools and, for the first time, enables independent scaling, allowing them to meet the latency requirements of TTFT and TPOT independently while maintaining high overall efficiency.
When DistServe was first launched, the team of Hao Zhang believed that it would be a disruptive idea.
Unexpectedly, it was not widely adopted at first.
For most of 2024, the open - source community was hesitant about this idea because a significant amount of engineering effort was required to deeply refactor the architecture of the existing inference system.
However, in 2025, the situation suddenly reversed: almost all mainstream large - model inference stacks regarded "decoupling" as the default option.
Firstly, as more and more enterprises use large models as their core business components, "latency control" has become a crucial factor determining business growth and even survival.
DistServe precisely addresses this pain point: it makes the latencies of prefill and decode easy to observe and control, and it can be continuously optimized in real - world production environments.
Secondly, as the model size expands dramatically and the access traffic surges, the inference system must be scaled to hundreds or even thousands of GPUs to support these large and highly variable loads.
At this scale, the advantages of the "decoupled architecture" are fully evident: it can allocate resources independently for different stages and flexibly cooperate with various parallel strategies to achieve extremely high resource utilization.
Thirdly, "decoupling" means that the composability of the system architecture is greatly enhanced.
Current Decoupled Inference
Today, the once - radical architectural concept has become one of the main design principles for large - model inference.
Almost all production - level frameworks related to large - model inference, from the orchestration layer, inference engine, storage system, to even emerging hardware architectures, have adopted the idea of decoupled inference in some form.
At the orchestration layer, the most representative one is NVIDIA Dynamo.
Schematic diagram of the NVIDIA Dynamo architecture
NVIDIA Dynamo is one of the most advanced and mature open - source data - center - level distributed inference frameworks, specifically designed for P/D decoupling.
In addition, llm - d, Ray Serve, etc., are all based on the decoupled inference architecture.
At the storage layer, LMCache, developed by a team from the University of Chicago, optimizes the P/D decoupling process by accelerating the movement of KV caches from prefill instances to decode instances.
Schematic diagram of the LMCache architecture
MoonCake, developed by the Kimi AI team, with the core idea of "centralized KVCache," built an LLM inference platform for P/D decoupling.
It abstracts the under - utilized storage media in the system into a centralized KV cache pool, enabling prefill instances to seamlessly transfer caches to decode instances in the cluster.
Schematic diagram of the MoonCake architecture
Today, LMCache and MoonCake have become the standard storage backends for large - scale LLM inference systems.
At the core engine layer, almost all open - source LLM inference engines, such as SGLang and vLLM, natively support "decoupled inference."
The Future of Decoupled Inference
The inference concept of "prefill - decode decoupling" has gradually matured in 2025.
However, this is just the beginning.
In the long run, decoupling is not just an "architectural technique" but a deeper - level system philosophy:
Break the "computational monolith" structure in neural network inference and allow the system to freely recombine between computing, storage, and communication.
Both the academic and industrial communities are exploring various new directions to push the decoupled architecture towards the stage of "Generalized Disaggregated Inference."
Decoupling at the Computational Level
1. Attention - FFN Disaggregation
Previous P/D decoupling mainly addressed the phased separation of "context input and autoregressive output," but the internal structure of the model was still considered an inseparable whole.
Now, researchers are beginning to attempt to refine the decoupling granularity at the model level.
In 2025, MIT CSAIL and DeepSeek Research proposed the "Attention–FFN Disaggregation" framework, which places the attention module (Attention) and the feed - forward layer (Feed Forward Network, FFN) of the Transformer on different computing nodes. Meanwhile, the A - F decoupling system, the MegaScale - Infer system, launched by the team of Liu Xuanzhe and Jin Xin in 2025, has also been widely deployed in the industry.
This architecture allows different nodes to leverage the advantages of heterogeneous hardware.
This means that future inference systems may no longer have "each node running a complete model copy" but rather each node running a functional sub - module of the model.
2. Pipeline Disaggregation
Another natural extension of the decoupled architecture is the cross - layer pipeline decomposition. Currently, several research teams have proposed frameworks, such as:
- The "DisPipe" system from Stanford DAWN;
- "HydraPipe" from Meta AI;
- "PipeShard" from Alibaba DAI - Lab.
These systems all attempt to let the inference process flow between different nodes in a "stage - stream" manner, thereby achieving global pipelined inference.
This approach enables different stages of computation to use different types of accelerators, making it more suitable for future multi - chip heterogeneous systems.
Decoupling of Cross - Modal and Multi - Model
1. Modal Decomposition
With the emergence of multi - modal large models, inference systems are facing more complex resource orchestration problems. Putting all of them into the same inference process will significantly reduce resource utilization.
Therefore, the future trend is to decouple multi - modal inference into multiple modal sub - inference streams and then perform asynchronous fusion through the scheduler at the orchestration layer.
2. Multi - Model Collaboration
Running multiple LLMs or dedicated sub - models simultaneously in an inference system has become common, and these architectures are naturally suitable for decoupled design.
Decoupling of Memory and Cache Systems
The current decoupling system still relies on a "centralized KV cache pool" or a "shared SSD cluster." Future research directions aim to achieve multi - layer decoupling and autonomous scheduling of the cache system itself.
1. Hierarchical Cache Architecture
Researchers from MIT and ETH Zürich proposed the HiKV (Hierarchical KV Cache) framework, which divides the KV cache into three levels:
- L1: Local GPU cache;
- L2: Node - shared cache;
- L3: Distributed persistent cache.
The system automatically migrates KV segments based on context popularity, making the memory management of decoupled inference more flexible.
2. Storage - Computing Collaboration
Some hardware manufacturers have begun to explore chips that natively support the decoupled architecture, which means that future "decoupled inference" will not only be a software architecture issue but will evolve into an integrated hardware - software system.
Towards Modular Intelligence
Some research teams, such as Google Brain Zürich and FAIR, have put forward a bolder idea:
Since inference can be decoupled, can training and continuous learning also be decoupled?
They divide the learning process of the model into multiple independent sub - tasks, each running on different hardware, and achieve cross - task communication through a shared gradient cache and a semantic router.
This concept, known as "decoupled learning," is regarded as a potential key path to solving the problems of "catastrophic forgetting" and "continuous adaptation" in large models:
Currently, the internal project