Exclusive: The Secret of DeepSeek's Cost Reduction - Two Tricks to Maximize Inference Deployment and Reserve Computing Power for Internal AGI Research

The official share has dropped to only 16%, and the traffic is given away for free to third-party hosting services.

On the 128th day since the debut of DeepSeek R1, it has completely disrupted the entire large model market!

First of all, it single - handedly drove down the price of inference models. The price of OpenAI's o3 updated in June was directly discounted to 20% of the original price compared to o1.

Secondly, the usage of DeepSeek models hosted on third - party platforms has skyrocketed, increasing by nearly 20 times compared to when they were first released, benefiting a large number of cloud computing providers.

However, the market share of DeepSeek's own website and API has been continuously declining, failing to keep up with the continuous growth trend of AI products in the first half of the year.

The above data is from a report released by SemiAnalysis, which comprehensively interprets the impact of DeepSeek on the AI model competition and the current situation of the AI market share.

Unveiling DeepSeek's Cost - Reduction Secrets

DeepSeek was indeed extremely popular when it was first released, but four months have passed, and the situation has become a bit delicate.

Judging from the data, the traffic of DeepSeek's own website and API has not increased but decreased, and the market share has also been continuously declining.

By May, the share of tokens generated by DeepSeek models across the network from DeepSeek's own platform had dropped to only 16%.

The traffic of the web - based chatbot has also dropped significantly, while the traffic of the web versions of other major large models has soared during the same period.

Both the DeepSeek V3 and R1 models have been updated. Their capabilities are stronger than those in January, and the prices are cheaper. So why are the users leaving?

Behind this phenomenon of "the flower blooms inside the wall but is appreciated outside" lies a lot of stories.

SemiAnalysis pointed out that in order to minimize the cost, DeepSeek has made a lot of compromises in service quality.

When users use the models on the DeepSeek official platform, they often have to wait for several seconds before seeing the first word appear. This can be measured by the First token latency indicator.

In contrast, although other platforms are generally more expensive, their response speeds are much faster, and some can even achieve almost zero latency.

On platforms such as Parasail or Friendli, users can get a quota of 1 million tokens with almost no latency by paying only $3 - $4.

If users want to choose a larger and more stable service provider, the price of the Microsoft Azure platform is 2.5 times that of the DeepSeek official platform, but the latency is reduced by a full 25 seconds.

From another perspective, DeepSeek official is not even the cheapest DeepSeek model service provider with the same latency.

If we use the size of bubbles to represent the context window in this graph, we can see another trade - off between price and performance of DeepSeek.

With limited inference computing resources, providing a service with only a 64k context window is one of the smallest among mainstream model providers.

In programming scenarios where the entire code library needs to be read, 64K is simply not enough, and users have to choose third - party platforms.

At the same price, platforms such as Lambda and Nebius can provide more than 2.5 times the context window.

DeepSeek also bundles many user requests together for processing. Although the cost per token is reduced, the waiting time for each user is also increased.

The Second Half of the Large Model Competition: Enhancing the Intelligence of Each Token

It should be clear that these cost - reduction strategies are all decisions actively made by DeepSeek.

Currently, they don't seem very interested in the user experience. They are not interested in making money from users, nor in providing a large number of tokens to users through chat applications or API services. Instead, they are more focused on achieving AGI.

From these optimization strategies, we can see that DeepSeek uses as little computing power as possible for external inference services, and a large amount of computing resources are reserved for internal R & D purposes.

At the same time, in combination with the open - source strategy, they let other cloud services host their models, winning influence and cultivating the ecosystem without missing either.

After all, the AI competition boils down to a competition for computing resources.

Under the influence of DeepSeek, Claude has also started to slow down to relieve the problem of computing power shortage, but it is still trying to balance the user experience for the sake of revenue.

Since the release of Claude 4 Sonnet, the output speed has decreased by 40%, but it is still much faster than DeepSeek.

In addition, the Claude model is designed to generate more concise responses. To answer the same question, DeepSeek and Gemini may use three times more tokens.

All signs indicate that large model providers are improving their models in multiple dimensions.

It's not just about increasing the upper limit of the model's intelligence, but enhancing the intelligence that each token can provide.

Reference link: [1]https://semianalysis.com/2025/07/03/deepseek-debrief-128-days-later/#speed-can-be-compensated-for

This article is from the WeChat official account "QbitAI". The author focuses on cutting - edge technology. It is published by 36Kr with authorization.

This article is originally produced by「量子位」， For reprint or content cooperation, please click Reprint Instructions ；Unauthorized reprint will be held accountable.

The secret to cost reduction of DeepSeek is exposed: Two tricks to extremely squeeze inference deployment, leaving all computing power for internal AGI research.

Unveiling DeepSeek's Cost - Reduction Secrets

The Second Half of the Large Model Competition: Enhancing the Intelligence of Each Token