DeepSeek's new paper reveals the new framework of V4, accelerating the inference performance of agents with idle network cards and breaking the bottleneck of PD separation.
DeepSeek is really smart. When the whole world is keeping an eye on its GitHub repository, waiting for V4 -
It quietly published a paper on ArXiv with Peking University and Tsinghua University, introducing a brand - new inference framework for agents: DualPath.
Moreover, it is related to the computing power topic exposed a few days ago.
The core of DualPath is to solve the I/O bottleneck in the long - text inference scenario of agents. By optimizing the speed of loading KV - Cache from external storage, it ensures that computing resources are not dragged down by storage reading.
It changes the traditional single - path loading mode of Storage - to - Prefill and introduces a second path of Storage - to - Decode.
By using the idle bandwidth of the Storage Network Interface Card (SNIC) of the decoding engine to read the cache and transmitting it to the pre - fill engine through the high - speed computing network (RDMA), DualPath realizes the global pooling and dynamic load balancing of the cluster storage bandwidth.
In the actual test of a production - level model with a scale of 660B, DualPath showed amazing performance:
The offline inference throughput increased by 1.87 times, and the online service throughput increased by an average of 1.96 times.
Under high load, the Time To First Token (TTFT) was significantly optimized, and the Token - to - Token generation speed (TPOT) was hardly affected.
Next, let's take a look together.
Dual - Path Loading
Generally speaking, DualPath is an inference framework specifically designed for agent systems. Its core insight is -
The loading of KV - Cache doesn't have to be centered around pre - filling.
In the past understanding, the one responsible for computing would move the data. However, DualPath believes that the cache can be first loaded into the decoding engine and then transmitted to the pre - fill engine through the high - performance RDMA network.
By dynamically choosing between the two paths, DualPath redistributes the network load and relieves the bandwidth pressure on the pre - fill side.
So, why go through so much trouble to take a "detour"?
The reason for this is that in current agent applications, there are many dialogue rounds and long contexts, and the hit rate of KV - Cache is usually as high as over 95%.
This means that a huge amount of "old memories" need to be moved in each round of dialogue. The bottleneck of inference performance has shifted from "computation" to "data movement".
In the existing Prefill - Decode Disaggregated (PD - disaggregated) architecture, all loading tasks are crowded on the storage network card of the pre - fill engine (PE), causing the bandwidth to saturate instantly;
Meanwhile, the storage network card of the decoding engine (DE) is idle, resulting in a serious mismatch of resources.
Furthermore, the growth of current GPU computing power is much faster than the growth of network bandwidth and HBM capacity, which also exacerbates the I/O limitation.
As emphasized repeatedly by big names such as Bill Dally, the chief scientist of NVIDIA, and Jeff Dean, an architect at Google: Computation is free, but data movement is expensive.
In response to these problems, DualPath constructs an innovative dual - path model:
- Path A (Traditional): Storage → PE. The cache is directly read into the pre - fill engine.
- Path B (Newly Added): Storage → DE → PE. The cache is first read into the buffer pool of the decoding engine and then transmitted to the pre - fill engine through RDMA.
In terms of architecture composition:
- Inference Engine: Each engine manages a GPU and is strictly divided into pre - fill (PE) and decode (DE).
- Traffic Manager: Responsible for H2D/D2H copying, inter - engine transmission, and SNIC storage reading and writing.
- Central Scheduler: Serves as the "brain", making real - time decisions on which path each request should take to maximize the utilization of global bandwidth.
Core Technical Solution: Storage - to - Decode Path
As mentioned above, the core of the DualPath inference system is to break the traditional single - path mode of "Storage - to - Prefill" and innovatively introduce the "Storage - to - Decode" path.
This design allows the KV - Cache to be first loaded into the decoding engine (DE) and then losslessly transmitted to the pre - fill engine (PE) through the high - bandwidth computing network (RDMA).
By dynamically distributing the load between the two paths, the system fully releases the bandwidth of the decoding - side storage network card (SNIC) that was originally idle in the cluster, constructing a globally schedulable storage I/O resource pool.
Specifically, to support hierarchical streaming processing, DualPath allocates a small amount of DRAM buffers (PE/DE Buffer) on both PE and DE and designs a detailed data flow for different stages:
- PE Reading Path: The KV - Cache of the hit Token is read from the storage into the PE buffer. Before each layer of computation, the cache of this layer is transmitted to the PE HBM and executed overlapping with the computation process. After the computation is completed, the full - volume KV - Cache is sent back to the DE buffer to form a complete context.
- DE Reading Path: The KV - Cache directly enters the DE buffer. During the pre - filling of the PE, the cache of the corresponding layer is transmitted across nodes to the PE HBM (computation overlapping). After the computation is completed, the PE only needs to send back the newly generated KV - Cache segment to be merged with the original cache in the DE.
- Decoding and Persistence: After the DE buffer receives the complete KV - Cache, it starts decoding, performs H2D copying, and then releases the CPU memory. Although introducing the buffer increases the DRAM pressure, it can significantly reduce the GPU video memory usage and optimize the Time To First Token (TTFT). During the generation process, asynchronous persistence is triggered every time a Block (e.g., 64 Tokens) is accumulated.
But as mentioned before, "detour" loading brings new problems: What if the traffic of moving the cache collides with the communication of model computation?
For this, DualPath provides two optimization solutions:
First, it is the traffic management centered around the Computing Network Interface Card (CNIC), forcing all traffic to take the GPUDirect RDMA path through the paired CNIC.
In the InfiniBand or RoCE network, using the Virtual Layer (VL/TC) technology, the inference communication is set to the "highest priority" and 99% of the bandwidth is reserved, allowing the cache movement to "steal" bandwidth only in the gaps to ensure no interference.
Second, it is the adaptive request scheduler: The scheduler monitors the disk queue length and the number of Tokens of each node. The system will give priority to assigning tasks to nodes with less I/O pressure and lighter computing load, fundamentally avoiding congestion of a single - side network card or single - point computing resources.
In the experimental stage, DualPath was tested on models such as DeepSeek - V3 and Qwen, covering scenarios of offline Rollout and online services.
As mentioned at the beginning, in offline inference, DualPath increased the end - to - end throughput by up to 1.87 times, and the online service throughput increased by an average of 1.96 times, significantly reducing the Time To First Token (TTFT) and maintaining extremely stable Token - to - Token delay (TBT).
Generally speaking, DualPath proves that by rethinking the data loading path, the current I/O wall of large - model inference can be effectively broken.
It successfully utilizes the originally wasted I/O bandwidth of the decoding engine. With the adaptive scheduling and strict traffic isolation mechanism, it significantly improves the efficiency of the agent LLM inference system without increasing the hardware cost.
One more thing
The first author of this paper, Wu Yongtong, is a doctoral student at Peking University, under the supervision of Professor Jin Xin.
His research focuses on system software and large - model infrastructure (LLM Infrastructure), especially the engineering optimization and large - scale deployment of inference systems.
He is currently in the DeepSeek system group, participating in the construction of the inference infrastructure for the next - generation model and is responsible for the performance optimization of large - scale software systems on multiple hardware platforms.
Previously, he also had internships at Tencent, the University of Washington, and Microsoft Research Asia.
Reference Links
[1]https://arxiv.org/pdf/2602.21548
[2]https://jokerwyt.github.io/
This article is from the WeChat official account "Quantum Bit", author: henry. It is published by 36Kr with authorization.