Jensen Huang said it was a "disaster": DeepSeek ran successfully on Huawei chips.
The API pricing for DeepSeek V4 has been released - the input for the Flash version costs $1 per million tokens, and the Pro version costs $12. During the same period, the call costs of GPT - 5.4 and Claude Opus 4.6 are approximately 50 times that of DeepSeek V4.
It's not just half the price; it's 50 times cheaper. This figure is so large that it doesn't seem like a quote from the same competition.
However, the price itself is not the key point. Looking back at the previous three generations of products - the training cost of V2 was 1/70 of that of GPT - 4 Turbo, V3 was 1/14 of GPT - 4, and R1 was 1/20 of GPT - 4o. DeepSeek has drawn a steep cost - decline curve. Even Altman himself said that the cost of AI drops by a factor of 10 every 12 months, which is even more significant than Moore's Law.
V4 also brings a bigger variable: It clearly states in the technical report that the parallel verification of Huawei Ascend NPU and NVIDIA GPU is carried out. It is the first cutting - edge large - model natively adapted to the Ascend platform. Huang Renxun said in a podcast that this is "catastrophic". NVIDIA's moat is not the computing power of its GPUs themselves, but the software ecosystem with CUDA as the default starting point.
It's 50 times cheaper and can run on the Ascend platform. How on earth did DeepSeek achieve this? And what does it mean when this efficiency - oriented path reaches its end?
A 50 - fold price difference
The API pricing for DeepSeek V4 has been released: For the Flash version, the input costs $1 per million tokens and the output costs $2; for the Pro version, the input costs $12 and the output costs $24. When the cache hits, the input price of the Flash version is reduced to $0.2 per million tokens.
During the same period, the API call costs of GPT - 5.4 and Claude Opus 4.6 are approximately 50 times that of V4.
V4 is fully open - source under the MIT license, and its release date happens to be one day before OpenAI launched its new Agent function. The intention of a head - to - head competition is quite obvious.
However, the pricing of V4 is not an isolated price - cut event. Looking back at the previous three generations of products, DeepSeek has drawn a steep cost - decline curve.
In early 2024, the training cost of V2 dropped to 1/70 of that of GPT - 4 Turbo, thanks to the combined innovation of the MLA architecture and the MoE sparse architecture. By the end of the same year, the training cost of V3 was $5.6 million, which was reduced to 1/14 of GPT - 4's $78 million. Subsequently, the training cost of R1 was $6 million, which was reduced to 1/20 of GPT - 4o's approximately $120 million training expenditure.
For three generations of products, each generation has reduced the cost by an order of magnitude. This is not a one - time promotion; it's a curve.
One year ago, on the day of R1's release, NVIDIA's market value evaporated by nearly $600 billion in a single day. The "DeepSeek Moment" has become a memory anchor for the entire tech circle. V4 has taken this story one step further.
Of course, this curve is not without controversy. Demis Hassabis, the head of Google DeepMind, bluntly stated that DeepSeek's cost data is "under - reported and somewhat misleading", claiming that the company "only announced the cost of the final training stage, which is only a small part of the total cost". The analysis firm SemiAnalysis further estimated that DeepSeek's spending on hardware is far more than $500 million, and the $6 million figure in the paper is only the GPU cost for pre - training operations.
Even if DeepSeek's hardware investment exceeds $500 million, this is a capital expenditure including chip procurement; while the training cost of GPT - 4o is approximately $120 million, referring to the computing power cost for a single training run. The two are not on the same scale. However, even if we take into account the tens of billions of dollars in computing power infrastructure investment behind OpenAI, DeepSeek still has an order - of - magnitude advantage in single - training costs. The focus of the controversy precisely proves the conclusion: Even if the cost is underestimated, it is still incredibly cheap.
This is not just the story of DeepSeek. From GPT - 4 to GPT - 4o, OpenAI's per - token price has also dropped by about 150 times. Even after such a significant reduction, DeepSeek's API price is still 95% cheaper than OpenAI's.
Altman himself clearly wrote in an article in February 2025 that the cost of using a specific level of AI drops by about a factor of 10 every 12 months. Moore's Law used to change the world by doubling every 18 months, while the decline in AI cost is "even more powerful".
When your competitor personally helps you prove your narrative, this narrative is no longer just a narrative. The AI industry is experiencing its own Moore's Law, and DeepSeek is the most radical executor of this law.
What is the mechanism behind this law? Why can the costs of three generations of products keep dropping? The answer lies in DeepSeek's technical path.
Spreading from algorithms to chips
Only 2048 H800s were used for the training of V3. The training clusters of models at the same level often require tens of thousands of cards, but DeepSeek used these 2048 cards to train a model comparable to GPT - 4, relying on a technology that no one dared to truly use in large - scale training at that time: FP8 mixed - precision.
NVIDIA's Transformer Engine has long supported FP8 training, but before V3, no open - source large model had truly run FP8 during the training stage. DeepSeek was the first to take the plunge. By using a fine - grained quantization strategy, it quantized the activation values in 1x128 tiles and the weights in 128x128 blocks, significantly reducing the computing cost without sacrificing the model quality.
It's not about the quantity of weapons, but about knowing how to use them. 2048 cards can do the work of tens of thousands of cards for others.
V4 has taken a big step forward on this path by directly transforming the attention mechanism itself.
The core lies in two brand - new attention structures. CSA (Compressed Sparse Attention) compresses the KV caches of every 4 tokens into 1 entry, and then uses a filter called Lightning Indexer to select only the most relevant 512 blocks from all the compressed blocks for calculation.
HCA (Hierarchical Compressed Attention) is even more radical, with a compression ratio of 128 times. It directly skips the screening process and performs full - scale calculations to capture the global structure. The two types of attention are configured alternately, and in combination with a sliding window that retains the original KV of the most recent 128 tokens, they have reduced the inference cost of long texts with one million tokens.
DeepSeek has directly stated: "From now on, a one - million - token context will be the standard for all official DeepSeek services." Previously, a one - million - token context was an indicator used to show off skills at product launches; now it is the default parameter.
When the technical cost is low enough to become the default option, it is no longer a competitive advantage but a part of the infrastructure.
The results are directly reflected in the benchmark scores.
For V4 - Pro, with 1.6 trillion parameters and 49B activations, the computing power required to process a new token under a one - million - token context is only 27% of that of V3.2, and the KV cache only accounts for 10%. The resource consumption has been reduced to one - quarter.
What about the top - of - the - line Pro Max? It scored 57.9 points in the knowledge benchmark SimpleQA, 20 points higher than the best open - source model; it achieved a full score of 120/120 in the Putnam 2025 math competition; and it ranked 23rd among human participants in the Codeforces programming competition. It topped the rankings in three completely different types of tasks simultaneously.
It uses one - quarter of the computing power but ranks first. This is not just about cost - reduction and efficiency - improvement; it's like operating under a different set of physical laws.
However, the most notable variable of V4 is not at the algorithm level.
Section 3.1 of the V4 technical report states: "We have verified this fine - grained expert parallel scheme on both the NVIDIA GPU and Huawei Ascend NPU platforms." The two platforms are listed side by side in the verification conclusion. This is not the wording of "compatible adaptation", but the stance of "native support".
The core of this scheme is to divide the communication and computation of MoE into finer particles and schedule them in "waves", accelerating general inference by 1.50 to 1.73 times and accelerating the long - tail small - batch reinforcement learning by up to 1.96 times. The Ascend platform has changed from an alternative option to a parallel option.
The migration is not easy. According to an engineer close to DeepSeek, during the adaptation process of V4 from CUDA to CANN, the most time - consuming part was not rewriting the operators, but aligning the precision. To make the same model produce exactly the same mathematical results on NVIDIA and Ascend platforms, repeated debugging is required.
Previously, when using the 910C for training, DeepSeek encountered problems: the gradient synchronization of the 1024 - card cluster timed out, and the old version of CANN lacked key operators, resulting in insufficient stability. The 950PR has specifically addressed these shortcomings: the inter - chip bandwidth has tripled, and CANN Next has built - in FlashAttention and PagedAttention operators.
True technical migration is not just replacing one brand of chips; it's making two completely different hardware systems produce the same mathematical results. DeepSeek has paved the way, and the threshold for later - comers has been significantly lowered.
Huawei's strategy is also clear. The Ascend 950PR has a computing power of 2 PFLOPS at FP4 precision and an inter - chip interconnection bandwidth of 2 TB/s. The positioning of CANN Next is not to start from scratch but to provide a seamless replacement: the newly added SIMT programming model is highly comparable to CUDA, allowing developers to follow the CUDA programming habits and ultimately compile optimized programs for Ascend.
Huang Renxun revealed what NVIDIA is really afraid of in an exclusive podcast interview with Patel. It's not that China can develop good models, but that good models no longer use CUDA as the default optimization starting point.
NVIDIA's moat has never been the computing power of its GPUs themselves, but the software ecosystem where CUDA has been the "de facto standard" for nearly two decades. Almost all mainstream AI frameworks, operator libraries, and open - source models prioritize CUDA as the default starting point for optimization. DeepSeek's native adaptation on the Ascend platform has precisely punctured the starting point of this chain: there is at least one real, operable, and verified non - CUDA path by a top - level model.
When the world's best open - source model proves a complete non - CUDA path, the two - decade - old ecological barrier has shown its first crack. The efficiency - oriented path has spread from algorithms to chips, reaching the position that NVIDIA fears the most.
Computing power becomes like water and electricity
In the research report after the release of V4, CITIC Construction Investment made a division: R1 answered the question of "Can China develop world - class models", while V4 answers two more specific questions - "Can it continue to evolve under the computing power blockade?" and "Can large models become profitable enterprise - level products?"
The academic community has already provided an answer to the first question. In September 2025, the R1 paper was published on the cover of Nature, and 8 experts reviewed it item by item. This is the first mainstream large model in the world to pass the peer review of a top - tier academic journal. The question of "Can China do it?" has been settled.
The second question is what V4 really needs to answer.
The tech giants are competing for the market in the most traditional way. During the Spring Festival in 2026, ByteDance, Alibaba, and Tencent spent nearly 10 billion yuan to attract new users. Qianwen spent 3 billion yuan on "milk tea gift packages", Doubao appeared on the CCTV Spring Festival Gala, and Yuanbao offered 1 billion yuan in cash red envelopes.
According to QuestMobile data, as of February 2026, Doubao had 103 million active users, Qianwen had 32.45 million, and DeepSeek had 24.77 million, ranking third.
However, DeepSeek's situation is different from that of the giants. Its daily active users soared from 120 million to about 200 million, a growth of over 67% in half a year, while the computing power only increased by about 8.3%. The daily average computing power cost exceeds 10 million yuan, and there have been three large - scale outages this year, all occurring during the peak user hours in the evening.
The user growth is 67%, while the computing power growth is 8.3%. This gap is the reason why DeepSeek must follow the efficiency - oriented path and why V4 must run on the Ascend platform.
The financing signals are also changing. At the height of DeepSeek's popularity in early 2025, Liang Wenfeng rejected all investment institutions. He once proposed a return - cap clause similar to the investment agreement between OpenAI and Microsoft, but no institution accepted it, and he hasn't met with investors since then.
On April 17 of the following year, it was reported that DeepSeek was seeking financing at a valuation of at least $10 billion; five days later, Reuters reported that Alibaba and Tencent were in talks to invest, and the valuation had been raised to over $20 billion. An investor close to DeepSeek said: "This is not an investment target that you can enter just by offering a high enough price. In Liang Wenfeng's screening criteria, money is the least important factor."
Rejecting everyone one year ago and being sought after by everyone one year later. What has changed is not Liang Wenfeng's attitude, but DeepSeek's position. It has moved from the technical verification stage to the commercialization inflection point.
The chain reaction triggered by DeepSeek's "chip replacement" to Ascend is spreading. Alibaba, ByteDance, and Tencent have placed bulk orders for the Ascend 950PR from Huawei, with a total of hundreds of thousands of chips. The centralized procurement has driven up the chip price by 20% in recent weeks. When industry leaders vote with their feet to follow the non - CUDA path, the efficiency - oriented path has changed from the choice of one company to an industry consensus.
The commercialization data also confirms the inflection point. Zhipu's annual revenue in 2025 was 724 million yuan, a year - on - year increase of 132%. The annual recurring revenue of its MaaS API platform reached 1.7 billion yuan, a year - on - year increase of 60 times. Large models are changing from a money - burning story to a profitable business.
When someone in the industry starts to make money, the narrative of the "AI bubble" needs to be changed.
Miller's judgment in Barron's magazine provides another perspective. He said that the gap between the United States and China lies not in talent or innovation, but in the computing resources invested during training. This is a typical stock - based logic, where the one with more cards wins.
However, DeepSeek is following an incremental logic: to make each card produce more. The successful operation of V4 on the Asc