On the eve of the release of DeepSeek-V4, a "crucial step" is taken first to clear the way for accelerating the intelligent agent.
DeepSeek has made new moves.
However, what was released this time is still not the long - awaited DeepSeek - V4.
But this doesn't prevent it from being a significant debut. DeepSeek, in collaboration with Tsinghua University and Peking University, jointly launched a brand - new inference system called DualPath.
More importantly, this system is not designed for regular conversations. It aims at the core challenges in the current more complex and popular agent scenarios.
DualPath significantly improves GPU utilization by restructuring the data loading method, enabling agents to run more smoothly and be more practical in the real world with long contexts and multi - round interactions.
Since it is a technological achievement jointly released by three top institutions, the paper is naturally full of a bunch of professional terms, which can be quite headache - inducing to read.
Don't worry, though. This article won't use jargon but plain language. It will help you easily understand what DualPath is and what makes it so great.
01
Agent Inference: Computing Power Takes a Backseat
You may have noticed that the trend in the AI circle has changed from "large models" to "agents".
In the past, when using large models, the interaction was simple: you input a prompt, the model thought for a few rounds, and then gave you an answer.
In the era of agents, things have become more complicated. The two parties in the interaction are no longer just "humans" and "machines", but also "machines" and "machines". The model not only needs to understand your words but also call the browser, open the code interpreter, and interact with the external environment on its own. The number of interactions has soared from a few times to dozens or even hundreds of times.
In this process, the input and output generated by the agent each time it calls a tool are actually quite short, perhaps only requiring a few hundred tokens. But the problem is that as the number of interaction rounds increases, the context accumulates like a snowball and eventually becomes a huge mass of hundreds of thousands of tokens.
In other words, agent tasks present a peculiar characteristic: multi - round, long context, and short appendage.
The direct consequence of this model is that the hit rate of KV - Cache often reaches over 95%.
What is KV - Cache? A drama - watching metaphor can help you understand:
Suppose the inference process of a large model is like you watching a TV series that has just reached the 20th episode.
The content of the 20th episode is composed of the plot background of the previous 19 episodes (that is, the context) plus the new plot of the 20th episode (new input).
Without KV - Cache, it's like you have amnesia. Every time you watch a new episode, you have to re - watch the previous 19 episodes from start to finish to understand the 20th episode.
With KV - Cache, it's as if you have firmly memorized the previous 19 episodes in your mind. You only need to watch the new episode to seamlessly continue watching.
The principle is the same for models with the Transformer architecture.
When an agent completes an interaction and is ready to handle the next task, most of the context it needs has already been calculated in previous interactions. It can simply read the cache, and only a very small amount of new content needs to be recalculated.
So, for a computer, the higher the hit rate of KV - Cache, the better, because a hit means "less effort".
But behind this "less effort", there is a new problem:
A powerful GPU may take less than 1 millisecond to calculate a new round of interaction with a few hundred tokens. But before that, it needs to obtain the "memory" of hundreds of thousands of tokens, that is, dozens of gigabytes of KV - Cache data.
To use KV - Cache to "save effort", this data has to be forcefully transferred from the hard drive or distributed storage device to the GPU's video memory.
This is like a top - notch chef who only needs 1 second to cook a dish, but his assistant takes 10 seconds to buy the ingredients.
Therefore, the biggest bottleneck in agent inference is no longer computing power but the input - output speed of KV - Cache data.
02
Existing Architecture: PD Separation
To improve inference performance, the architecture commonly adopted in the industry is called "Pre - fill - Decode Separation", abbreviated as PD separation.
Put simply, under this architecture, the GPU cluster is divided into two departments:
One is the pre - fill engine, which is responsible for processing a large amount of input text. It is a compute - intensive task and is good at batch processing. The other is the decode engine, which is responsible for generating answers character by character. It is extremely sensitive to latency but limited by memory.
In this organizational mode, the pre - fill engine needs to continuously load a large amount of KV - Cache data from external storage. Its storage network card is almost always in an oversaturated state, completely congested.
Meanwhile, although the decode engine is also running normally, its storage network card is idle most of the time.
In a warehouse, the incoming goods gate is blocked, and the outgoing goods gate is empty. The entire logistics line is stuck.
In today's era of high computing power costs, leaving the hardware resources in a high - performance chip cluster idle is simply a huge waste.
The most straightforward solution is, of course, to widen the incoming goods gate, that is, to increase the bandwidth for the pre - fill engine. But in practice, this is neither realistic nor extremely costly.
A smarter approach is to let the outgoing goods gate also help with incoming goods. That is, let the idle decode engine share part of the "data pulling" task.
03
DualPath: Make a Feint to the East and Attack in the West
The research team from DeepSeek, Tsinghua University, and Peking University got inspiration from the research on modern AI data centers.
Similar to NVIDIA's AI supercomputer DGX SuperPOD, its architecture generally has an important hardware feature: network isolation.
Each GPU is generally equipped with two sets of network cards:
One is the Compute NIC: It is specifically used for cross - node and cross - card communication between GPUs. Usually, multiple cards are equipped with a very large total transmission bandwidth.
The other is the Storage NIC: It is used to read and write data on the hard drive or distributed storage. Usually, only one card is equipped, and the total bandwidth is relatively small.
On this basis, the research team tried to make full use of network transmission performance and proposed the idea of dual - path KV - Cache loading.
The previous architecture used a path where the pre - fill engine directly pulled KV - Cache data from the hard drive or distributed storage through its own storage network card.
However, DualPath allows the idle decode engine to use the storage network card to pull KV - Cache data from the hard drive or distributed storage to its memory and then quickly transfer the data to the pre - fill engine through the high - bandwidth computing network.
Of course, DualPath won't blindly let the decode engine help. It will monitor the congestion of the two gates in real - time.
In this way, when the incoming goods gate is blocked and there is no outgoing goods for the moment, the outgoing goods gate also starts to handle incoming goods. The bandwidth of the storage network cards of all engines is effectively utilized, and the problem of asymmetric bandwidth saturation is solved.
The research team proved through strict bandwidth analysis that under the common ratio of pre - fill and decode nodes, while DualPath saturates the bandwidth of the storage network card, the bandwidth of the compute network card will not become a new bottleneck, covering most actual deployment scenarios.
04
Traffic Scheduling and Priority Game
Although the data flow takes a much longer detour, the actual inference efficiency can be significantly improved. The idea seems very promising.
However, there are still quite significant challenges to implement it in a system running at the microsecond level:
One is the chaos brought by the introduction of a large amount of data:
It's indeed a good idea to let the decode engine help pull historical memory data (KV - Cache), but it also brings huge risks.
During the inference process, the GPU needs to frequently conduct "collective communication" with other GPUs in the cluster to complete data synchronization and result exchange. This communication is extremely sensitive to latency and cannot afford even a slight delay.
If the decode engine starts to download several gigabytes of KV - Cache data, the data flow like a volcanic eruption may squeeze the network bandwidth. If the collective communication between GPUs is unfortunately blocked, the inference process will still get stuck.
To solve this chaotic situation, the research team set up a "traffic policeman" on the network card level:
The communication between GPUs must have the highest priority. It has the right to use the VIP channel and must be guaranteed to run normally without congestion at all times.
The task of pulling KV - Cache data only has normal priority. It can only start when there are no vehicles in the VIP channel. As soon as a GPU communication task appears, it has to give way immediately.
This "traffic policeman" played by the Compute NIC (CNIC) must completely isolate the two types of data traffic to ensure that the data pulling by the decode engine must not affect the collective communication between GPUs.
The other is how to dynamically allocate tasks:
People's various needs mean that the inference tasks of agents are always dynamically changing. Sometimes there are many requests, sometimes few; some requests are long, and some are short.
If this "traffic policeman" gives wrong instructions, it will definitely backfire. For example, when the bandwidth of the pre - fill engine is not saturated, it still makes the decode engine take a long - way detour to pull data.
How to dynamically allocate tasks through Load Balance in real - time is a mathematical problem that this "traffic policeman" must face.
To this end, the research team designed an adaptive request scheduler to let the system dynamically select the optimal data loading path according to the queue length of the storage network card, GPU computing load, and request characteristics during operation.
Between engines, it not only monitors the current computing load of each GPU, that is, the number of tokens to be processed, but also monitors the disk read queue length of the underlying distributed storage on each node.
In this way, new requests will always be intelligently allocated to the engine with the shortest read queue and the least - busy GPU for loading.
Inside the engine, since multiple GPUs are bound to work together, all GPUs must finish their current tasks at the same time to enter the next stage. This is the synchronization of the attention mechanism.
To prevent the GPU with a short - task from "waiting idly" for the GPU with a long - task, it needs to use a batch - processing selection algorithm based on computing quotas to split long tasks into short tasks. In this way, the time for multiple GPUs to calculate the attention mechanism can be basically aligned, and they can enter the next stage as soon as possible.