DeepSeek releases next-generation technology, and interns from Peking University contribute significantly.
DeepSeek has found a new way to break through the inference bottleneck of large models!
According to a report by Zhidx on February 27th, yesterday, DeepSeek released a brand - new inference system solution called DualPath, which directly addresses the shortcoming of current large language models in the agent application scenario - the I/O bottleneck of KV cache storage. By introducing a dual - path loading mechanism, this solution significantly improves the system throughput and basically eliminates the I/O overhead of the KV cache.
The core innovation of DualPath lies in opening up a new channel from storage directly to the decoding engine. The KV cache is no longer only loaded by the pre - filling engine. Instead, it can be loaded into the decoding engine and then efficiently transmitted to the pre - filling end through RDMA in the computing network. This design not only relieves the pressure on the storage end but also avoids network congestion, ensuring that latency - sensitive tasks are not interfered with.
After collaborating with the global scheduler, DualPath achieves a dynamic balance of the loads at both ends, further improving resource utilization. In real - world agent workload tests, DualPath increases the offline inference throughput by up to 1.87 times and the online service throughput by an average of 1.96 times.
In terms of large - scale scalability, the DualPath system has been verified on a maximum of 1152 GPUs. Offline inference has achieved nearly linear scaling from 2P4D (2K agents) to 48P96D (48K agents), with the task completion time remaining basically the same.
It is worth mentioning that, similar to many research papers published by DeepSeek before, the first author of this paper, Wu Yongtong, is also an intern at DeepSeek. Wu Yongtong is currently pursuing a doctoral degree at Peking University, under the supervision of Professor Jin Xin. He mainly studies topics related to large - model infrastructure. He has been working in the DeepSeek system group since August 2025 and has participated in the research of DeepSeek - V3.2.
01 The I/O bottleneck of agents is prominent, and traditional designs are costly
With the popularization of agent applications, multi - round inference has become the norm. Agents interact with the external environment through tools for dozens or even hundreds of rounds, and the context accumulates to an extremely long length across rounds. Due to the characteristics of multi - round and short appendage, the hit rate of the KV cache is over 95%, and loading efficiency has replaced computation as the dominant factor in performance.
Existing systems adopt hierarchical pre - filling, pre - filling - decoding separation (PD separation), and external KV cache storage architectures. However, the problem is that the bandwidth of the storage network card of the pre - filling engine is continuously saturated, while a large amount of the bandwidth of the storage network card of the decoding engine is idle. This imbalance exposes a fundamental inefficiency - uneven utilization of the storage network bandwidth, and simply increasing the bandwidth of the pre - filling end is costly.
▲ Existing bottlenecks (left) and DualPath (right)
DualPath is proposed to solve the above problems. Its core insight is to break the traditional design of "KV cache loading must be centered around pre - filling".
Existing systems only load through a single path from storage to the pre - filling engine, resulting in saturated bandwidth at the pre - filling end and idle bandwidth at the decoding end. DualPath adds a path from storage to decoding. It first loads the KV cache into the idle decoding engine and then efficiently transmits it to the pre - filling engine through RDMA.
This model aggregates the bandwidth of all storage network cards and redistributes the network load, fundamentally alleviating the I/O bottleneck at the pre - filling end.
However, this design still faces two major challenges. First, introducing an additional loading path will generate complex traffic patterns and may potentially interfere with the collective communication primitives in model execution. If not properly managed, it will reduce the overall performance.
Second, the system must decide online which loading path to use under dynamic and heterogeneous workloads and ensure load balancing between the GPU and the network card at the same time.
02 Three core components create DualPath, and the new components do not introduce bottlenecks
So, how exactly did DeepSeek solve these challenges? DualPath uses two widely used technologies:
(1) PD separation, which separates prompt and decoding processing to improve efficiency.
(1) Hierarchical pre - filling, which avoids the HBM bottleneck on the pre - filling engine and improves GPU utilization.
DualPath mainly consists of three core components. The inference engine is the basic execution unit. Each engine manages a GPU and is clearly divided into a pre - filling engine dedicated to pre - filling computation and a decoding engine responsible for decoding and generation.
The traffic manager is embedded in each engine and coordinates all data movements, including memory copying between the host and the device, KV cache transmission between the pre - filling and decoding engines, and persistent reading and writing of the KV cache through the storage network card. It adopts a traffic management strategy centered around the computing network card to ensure that KV cache traffic does not interfere with the latency - sensitive collective communication of the model.
The request scheduler serves as the central decision - making unit. It receives client requests and intelligently distributes them to each engine. At the same time, it dynamically decides whether each request uses the traditional storage - to - pre - filling path or the new storage - to - decoding path, achieving traffic balance between the two paths and global load optimization.
In terms of specific implementation, DualPath reserves a small amount of DRAM as a buffer on each pre - filling engine and decoding engine. For the read path at the pre - filling end, the KV cache of the hit tokens is first read from the storage into the pre - filling engine buffer and then streamed layer by layer into the HBM of the pre - filling engine, overlapping with the KV calculation process of the non - hit tokens. Subsequently, the complete prompt KV will be transmitted to the decoding engine buffer for use in the decoding stage.
For the read path at the decoding end, the hit KVs are first loaded into the decoding engine buffer. When the pre - filling engine performs pre - filling, they are read layer by layer through RDMA and overlapped with the computation. After the non - hit KV calculation is completed, it is transmitted back to the decoding engine and merged with the hit KV to form a complete prompt cache.
Regardless of the path, data transmission adopts a hierarchical streaming method to relieve the pressure on the HBM capacity and achieve the overlap of computation and communication. Before the decoding stage begins, the decoding engine transfers the complete KV from the buffer to the HBM. After the host - to - device copy is completed, the CPU memory is released. During the generation process, whenever a fixed - size token block is accumulated, it is immediately persisted to the storage.
To verify that this architecture does not introduce new bottlenecks, the paper systematically analyzes the bandwidth of the computing network card and the DRAM. By establishing a traffic model between each pair of pre - filling engine and decoding engine and assuming load balancing and no network congestion, the author deduces that within a certain range of P/D (the ratio of the number of pre - filling nodes to the number of decoding nodes), the computing network card, PCIe, and DRAM will not become bottlenecks.
Under a typical configuration (e.g., 8 GPUs per node and storage bandwidth much smaller than computing bandwidth), the feasible P/D range covers most actual deployment ratios. This indicates that the system can fully utilize the bandwidth of all storage network cards while maintaining the stable operation of computing and memory resources.
03 There are still three major challenges in system implementation, and a traffic management centered around the computing network card is adopted
However, implementing the dual - path architecture in a real system still faces three core challenges. The first is fine - grained data transmission. Hierarchical execution relieves the pressure on the HBM capacity, but it also splits the KV into a large number of small blocks, which need to be efficiently transferred between the storage, host DRAM, and GPU HBM while controlling software and hardware overhead.
The second is traffic isolation. The new KV transmission may interfere with the latency - sensitive collective communication (such as AllToAll, ReduceScatter/AllGather) in model execution. Without an isolation mechanism, it will directly increase the end - to - end inference latency.
The third is dynamic load balancing. Since there are two read paths in the system, the scheduler must make dynamic decisions based on the disk queue length, GPU load, and request characteristics. Otherwise, local bottlenecks may easily form again.
To avoid KV transmission from interfering with model communication, the system adopts a traffic management mechanism centered around the computing network card. All traffic in and out of the GPU, including H2D/D2H copying, is unified through the computing network card paired with the GPU and completed through GPUDirect RDMA, so that all data streams converge to the computing network, and hardware QoS capabilities can be used for priority isolation.
In the deployment based on InfiniBand, model inference communication is mapped to high - priority virtual channels, and KV transmission is mapped to low - priority channels. Weighted round - robin is used to ensure the bandwidth of the former. This not only protects latency - sensitive communication but also allows KV traffic to utilize idle bandwidth. Experiments also show that it is more suitable for fine - grained transmission in the scenario of a large number of small blocks.
At the scheduling level, the system adopts a two - level adaptive mechanism. Inter - engine scheduling selects a pre - filling engine - decoding engine pair for the request and determines the read path, achieving load balancing through the number of tokens and the disk queue length. Decoding engine scheduling is divided into two stages: cross - group and intra - group. While balancing the total number of tokens, it also considers the HBM capacity constraint to avoid resource overload.
Intra - engine scheduling mainly acts on the pre - filling engine. By estimating the computational volume of the attention layer, a "computational quota" is set, and requests are batched in a FIFO manner. If necessary, requests are divided into blocks to make the computing time of each GPU tend to be the same, reducing synchronization waiting.
Overall, dual - path loading aggregates storage bandwidth. Theoretical analysis ensures that there are no new bottlenecks in the system. The design centered around the computing network card achieves strict traffic isolation, and adaptive scheduling maintains load balancing and low latency. Together, they form a high - throughput and scalable inference architecture.
04 Experiments prove that the I/O overhead of the KV cache has been basically eliminated, and linear scaling is achieved on a thousand - GPU cluster
To verify the performance improvement brought by DualPath, DeepSeek conducted experiments on a GPU server cluster interconnected by InfiniBand and evaluated the performance of three models: DeepSeek V3.2 660B (denoted as DS 660B), a 27B scaled - down version of DS 660B (denoted as DS 27B), and Qwen2.5 - 32B (denoted as Qwen 32B) as a representative of dense models.
The experimental results show that DualPath benefits more significantly under a larger batch size and a longer maximum effective context length. On DS 660B, DualPath achieves a maximum acceleration of 1.87 times compared with DeepSeek's internal baseline inference framework, and its performance is close to the theoretical performance upper limit assuming zero I/O overhead, indicating that the I/O overhead of the KV cache has been basically eliminated.
On DS 27B, DualPath increases the performance by up to 1.78 times compared with DeepSeek's internal baseline inference framework.
When changing the append length and generation length, DualPath has a more obvious advantage in the short - token scenario. As the append length increases, the GPU computing pressure increases, and as the generation length increases, the KV cache loading pressure decreases due to the longer pre - filling interval.
Figure 9 shows that as the append length increases, the performance of the inference engine without DualPath becomes closer to that of DualPath, indicating that the system bottleneck gradually shifts to GPU computing. Under different append scales, DualPath achieves an acceleration of 1.82 to 1.99 times compared with the baseline, and the expansion trend of the generation length is similar.
Under different pre - filling - decoding ratios, DualPath significantly outperforms the baseline, achieving an average acceleration of 1.64 times and a maximum of 2.46 times. The baseline inference engine can only use the storage bandwidth of the pre - filling nodes, while DualPath can utilize the bandwidth of all nodes, verifying that storage bandwidth is the main bottleneck in the agent scenario.
In the online service evaluation, DualPath significantly outperforms the baseline in terms of the agent request arrival rate, achieving 1.67 times and 2.25 times improvement on DS27B and DS660B respectively.
In terms of load balancing, DualPath significantly improves the balance of the storage network card and the execution time of the attention layer. Compared with round - robin scheduling, the scheduling algorithm optimizes the load - balancing index of the storage network card from 1.53 to 1.18. At the same time, in the first 5% of the task execution stage, the ratio of the maximum/average execution time of the attention layer is controlled within 1.06, reducing GPU idle bubbles.
In terms of large - scale scalability, the DualPath system has been verified on a maximum of 1152 GPUs. Offline inference has achieved nearly linear scaling from 2P4D (2K agents) to 48P96D (48K agents), with the task completion time remaining basically the same.
In the online service, the 44P88D configuration increases the throughput by 22 times while maintaining a similar latency. In all experiments, the CPU occupancy of the scheduler is less than 10 cores, indicating that it is not a performance bottleneck.
Large - scale deployment not only reduces resource fragmentation but also provides greater flexibility for optimizing parallelism and the P/D ratio. At the same time, it provides more scheduling space to relieve queuing latency in the scenario of sudden online requests.
05 Conclusion: An efficiency - enhancing tool for agent inference emerges, and an adaptive mechanism may be introduced in the future
With the release of the DualPath paper, it is expected to provide a new idea for the industry in handling large - scale agent inference tasks. For developers and researchers suffering from the I/O pressure of the KV cache, this may be a direction worthy of attention.
However, the research team at DeepSeek also admits that the workload of offline inference is highly dynamic. The next step is to study more adaptive and flexible configuration methods for parallelism and the P/D ratio,