Can Deepseek save China $1 trillion?
Introduction
In the second half of 2026, NVIDIA will deliver its most powerful AI platform to date: the Vera Rubin VR200 NVL72. It packs 72 Rubin GPUs and 36 Vera CPUs into a single rack. Morgan Stanley estimates that the bill of materials for this machine is around $7.8 million.
This figure is already astonishing. But what's even more worth paying attention to is where the money is being spent.
Out of the $7.8 million, approximately $2 million is not spent on the world - famous GPU chips or the computing cores, but rather on memory - high - bandwidth memory (HBM4) and regular memory (LPDDR5X). In just one year, the cost of this part of the memory has skyrocketed by 435% due to price increases.
This is a sign. In the increasingly expensive AI machines, money is flowing in large amounts from the "components responsible for calculation" to the "components responsible for memory and storage."
Please remember this sign. Because the DeepSeek this article is about is doing exactly the opposite: Everyone is being pushed by the times to pay a premium for AI hardware due to the rising cost of memory. But it is finding ways to, without compromising competitiveness, increase the token production capacity of these expensive hardware components by more than four times through software - hardware integration, which is equivalent to saving 75% of the hardware investment.
At the end of this, there is a hot - debated conjecture recently: Can DeepSeek, through its own efforts, save one trillion US dollars for China's AI infrastructure construction?
Is this really possible?
One Trillion US Dollars: The Savings
The NVIDIA price list mentioned earlier represents the most substantial expense in the current AI infrastructure ledger. In the current supply - demand landscape, if you want to buy the most advanced AI machines, you have to accept this bill.
DeepSeek can't change this.
What it can change is another thing: For the same machine and the same $2 million worth of expensive storage hardware, how many Tokens can it actually produce?
This question has become particularly concrete after the release of DeepSeek V4.
What's more worthy of attention in V4 is not just the model itself, but the three key strategies it demonstrates: First, continue to compress the "memory" so that long contexts no longer exhaust the video memory; Second, wake up the "body" on - demand so that the large expert models don't have to be fully mobilized every time; Third, turn repeated calculations into reusable assets so that the calculated contexts don't keep burning money.
These technologies have a prominent feature - they focus on software - hardware collaboration rather than just pure software optimization. That's why some people use the joking analogy that DeepSeek might become China's largest AI hardware company.
Its model page shows that in a 1 - million - Token context scenario, compared with the previous generation, V4 - Pro only requires 27% of the single - Token inference computing power and 10% of the cache usage. In this article, we'll use the approximate value of one - quarter of the computing power to calculate the following figures.
Under the traditional approach, these hardware components can only support one unit of throughput. But through long - context compression, on - demand activation, cache reuse, and inference scheduling, DeepSeek can increase the effective Token output of the same hardware by four times. So, the cost isn't "cut"; it's spread out. What used to require four machines can now be done by one. Instead of each Token consuming a full share of the expensive hardware cost, the same hardware cost can now be spread across four Tokens.
This is where DeepSeek truly shines: It doesn't change NVIDIA's price list, but it does change the output rate of NVIDIA machines in the AI ledger. The significance of this is far greater than a simple API price cut.
The figure of one trillion US dollars is not a baseless assumption.
The McKinsey report "The Cost of Computing" in 2026 provided a specific figure: By 2030, global data centers will need an investment of approximately $67 trillion to keep up with the computing power demand. Among this, the part dedicated to handling AI loads will consume about $52 trillion.
In other words, in the next few years, the money that humanity plans to invest in AI hardware will be measured in trillions of US dollars.
A large part of this huge sum will flow to the most advanced and scarce hardware - namely, HBM high - bandwidth memory and LPDDR memory. What DeepSeek is doing is systematically reducing the entire Chinese AI industry's reliance on this expensive hardware. Even if it only reduces the dependency by a small amount, the savings for the industry will be in the trillions, an astronomical figure.
As China's daily Token consumption increases from over one hundred trillion today to hundreds or thousands of trillions, any reduction in the unit Token cost will be magnified into a huge infrastructure cost difference. If the same throughput can truly be achieved with one - quarter of the hardware, then in the foreseeable future, it is possible for DeepSeek to save nearly one trillion US dollars in computing hardware investment for China's AI infrastructure.
This is an infrastructure - related calculation: Whoever can produce more Tokens with the same fixed hardware expenditure will need to build fewer data centers, buy fewer GPUs, and use less video memory. They will be redistributing the entry tickets for future AI.
So, how does DeepSeek achieve this? The answer is that it has made three key improvements to the large - model "machine".
Two "Gas Guzzlers"
A common misconception is that the most expensive part of large models is "thinking", that is, the computation. In fact, it's not.
The real two "gas guzzlers" are the "memory" and the "body". And they consume the most expensive fuel - high - bandwidth memory (HBM), a type of memory that is directly integrated into the GPU packaging system, extremely fast, and extremely expensive.
Let's start with the memory. When generating text, large models have a rather clumsy characteristic: Every time they generate a new character, they have to look back at all the previous content. This is because the meaning of language is built up layer by layer, and what to say next completely depends on the context already established.
This is like a simultaneous interpreter. He can't start speaking just based on your last sentence; he has to keep in mind everything you've said before. Only by remembering those contexts can he understand the true meaning of the current sentence. The longer you speak, the more he has to remember.
To avoid recalculating from scratch for each character (which would be extremely slow and impractical), the model temporarily stores the intermediate results that have already been calculated. This archive is called the KV cache (Key - Value Cache, which can be understood as the model's short - term memory).
The problem is that it will expand exponentially as the conversation gets longer.
Take a specific example: According to the estimation of a certain standard structure, when processing a context of about 120,000 characters, this "memory" alone may consume 488GB of high - bandwidth memory. And the top - of - the - line Rubin GPU that NVIDIA is about to deliver has a single - card memory capacity of 288GB. That means, just to store this "memory", it would take up almost one and a half, or even nearly two of the most advanced GPUs' entire memory - and at this point, the model hasn't even really started working yet.
Now, let's talk about the "body". The "body" of a model refers to its parameter weights, which can be roughly understood as the carrier of all its knowledge and abilities. The more capable the model is, the larger its "body" usually is, often with hundreds of billions or even trillions of parameters.
Traditional dense models (Dense Models, which means models that use all parameters to process any input) have a problem: No matter what you ask them, they have to mobilize their entire "body". It's like going to a hospital just to see a dentist, but all the doctors from all departments in the hospital are called in to examine you from head to toe, and then the dentist finally sees you. It's absurd, but you still have to pay the full bill.
This large "body" also has to reside in the expensive high - bandwidth memory, always on standby.
The "memory" and the "body", these two "gas guzzlers", have firmly pushed the value distribution of the entire hardware system towards the most expensive, scarce, and supply - constrained hardware. In the past few decades, the industry's countermeasure has been simple and crude: if there's not enough computing power, add more; if there's not enough memory, add more. As a result, the industry's wealth has become highly concentrated in this most advanced hardware chain, and the fattest profits are stuck in the scarcest link.
The price of Tokens has been hijacked by the scarcity of a certain type of hardware. And DeepSeek's three improvements are precisely aimed at loosening this stranglehold.
The First Cut: Operating on the "Brain"
The first cut is made on the "memory", and the target is the most off - limits or least - touched part of the entire system - the attention mechanism (Attention, the core mechanism that large models use to understand the context relationship).
The attention mechanism is the "brain" of the large model. It can understand the context and grasp the key points in long conversations thanks to this mechanism constantly weighing the relationships between each character. The expensive "memory" mentioned earlier is the product of each "pulse" of this brain.
Wanting to save on memory but fearing the risks, almost everyone has chosen to bypass this "brain" and only make adjustments on the periphery. From the Multi - Query Attention (MQA) proposed by Noam Shazeer, one of the original authors of the Transformer in 2019, to the Grouped Query Attention (GQA) proposed by Google in 2023 and widely adopted by models like Llama, the mainstream approach has always been to "let multiple query heads share the same memory" - essentially, "remember less and make do". The memory - saving effect is remarkable, but the price is a reduction in model quality. In essence, the consensus of this approach has always been "compromise": it's assumed that compression will inevitably damage the quality, and the only thing up for negotiation is the degree of damage.
DeepSeek refuses to compromise. It chooses to operate directly on the "brain" and reform the attention mechanism itself.
Its solution is called Multi - head Latent Attention (MLA), which first appeared in DeepSeek - V2 in 2024. To put it in an analogy: Other models take notes by transcribing every detail exactly, filling up several notebooks. MLA, on the other hand, first refines the notes into a highly concentrated summary, stores only the summary, and then accurately restores the details when needed. Technically, this is called "low - rank compression" - projecting the seemingly complex but actually highly redundant memory into a much more compact space for storage.
How amazing is the effect? The results presented in the DeepSeek - V2 paper show that compared with the previous - generation model from the same family, V2 has stronger capabilities, while reducing the training cost by 42.5%, reducing the KV Cache by 93.3%, and increasing the maximum generation throughput by 5.76 times. In the previous example where 488GB of memory was consumed, with this approach, it could be reduced to just a few gigabytes.
But what's truly remarkable is not how much memory is saved, but that it hardly incurs any loss of detail.
Normally, if you compress a book into a one - page summary, you'll never be able to retrieve all the details no matter how you try to restore it. But in the experiments announced by DeepSeek, this compressed memory not only doesn't perform worse than the standard attention mechanism that "transcribes the entire book", but in some cases, it's even slightly better.
By V4, this approach has been extended to an even more extreme long - context scenario: V4 - Pro uses a hybrid attention architecture. Under the setting of a 1 - million - Token context, compared with the previous generation, it only requires 27% of the inference computing power and 10% of the cache usage.
To understand how difficult this is, you have to realize that this is like performing surgery on a flying plane. Modifying the attention mechanism means rewriting the most fundamental calculation logic of the model, retraining the entire model, and rebuilding the entire service system that supports its operation. If any part goes wrong, the intelligence of the model will collapse. This is not like changing a valve on a tire; it's like a brain surgery.
And DeepSeek has managed to make the AI even healthier after the surgery than before.
The Second and Third Cuts: Installing Numbered Lockers for the Machine
The first cut tamed the "memory". The second cut is aimed at the large "body".
The idea for this cut is not original to DeepSeek; it follows a well - established path: Mixture of Experts (MoE), which means splitting the model into many "experts" and only activating a few of them each time.
This concept dates back to 1991 and was introduced into neural networks in 2017 by Shazeer et al. Subsequently, Google's GShard and Switch Transformer incorporated it into the Transformer. What really made it popular was the Mixtral 8x7B released by the French Mistral company at the end of 2023, which only provided a seed link. It has a total of about 46.7 billion parameters, but only activates about 12.9 billion when processing each character.
Going back to the "hospital where seeing a dentist involves the whole hospital", what MoE does is transform it into a well - organized hospital: When you go to see a dentist, the receptionist directly guides you to the dental department, and the doctors in other departments can go about their normal business. The total number of staff in the hospital can still be large, and the total number of parameters can reach hundreds of billions or even trillions, but only a small part of them are actually activated each time.
DeepSeek pushed this approach to a rather radical scale in V3, and it's even more extreme in the V4 era - V4 - Pro has a total of 1.6 trillion parameters and 49 billion activated parameters; V4 - Flash has a total of 284 billion parameters and 13 billion activated parameters. That means the "total body" of the model continues to grow, but the part that actually moves at each step is still kept very small.
But the real ingenuity of the second cut lies not just in "activating fewer doctors". It also transforms the way the model accesses this "body".
Let's use a more appropriate analogy. In the past, large models were like a huge but disorganized storage room: All the items were piled up together, and every time you wanted to retrieve just one thing, you had to open the door and rummage through everything from the very bottom. To make this searching fast enough to handle a large number of customers, you had to move the entire storage room to the most expensive "prime location" - that is, the high - bandwidth memory.
DeepSeek has transformed this storage room into a cabinet with tens of thousands of numbered compartments. When you need something, you can simply open the corresponding compartment by its number without touching the others. This means that you no longer need to stack the entire cabinet of items in the most expensive location. Most of the compartments that are not currently in use can be stored in much cheaper regular memory (LPDDR) or even cheaper solid - state drives and retrieved quickly when needed. DeepSeek's ecosystem and open - source inference systems like SGLang are continuously exploring such unloading and streaming loading methods.
At this point, the synergy between the first two cuts becomes apparent: The first cut reduces the "memory", and the second cut numbers the "body" and only retrieves the necessary part. Together, these two cuts ensure that the part of the machine that actually needs to occupy the most expensive memory at any given moment is minimized.
The third cut takes the logic of "retrieving by number" to the extreme: It also tries to save on the "computation" action. Some calculation results can actually be pre - calculated and stored as numbered compartments, and retrieved directly when needed instead of being recalculated each time. It's like someone who has memorized the multiplication table doesn't need to count on their fingers every time they need to calculate 7 times 8; they can just say 56. This is equivalent to replacing the extremely expensive "hard calculation" (chip operation) with the extremely cheap "lookup" (memory reading).
In V4, this cut has a more direct commercial expression: The cache hit price has been significantly reduced, and long - context reuse is directly incorporated into the pricing system - Repeated calculations are not only technically savable but also commercially encouraged to be saved.
Looking at the three cuts together, they are not three isolated things but a progressive set of the same logic: Transforming a chaotic mess that has to be rummaged through into a system where everything can be retrieved precisely by number. The "memory" is minimized, only the necessary part of the "body" is activated, and if a calculation can be looked up, it won't be recalculated. Each cut reduces the machine's reliance on the most expensive hardware, and when all three cuts are combined, the consumption of the most advanced hardware for running the same task is only a fraction of what it used to be.