When it comes to saving money, I have nothing but admiration for Liang Wenfeng.
The most criticized aspect of DeepSeek is that its servers always crash. However, from now on, server lag and downtime may never occur again for DeepSeek.
The reason is that Liang Wenfeng co - authored a paper titled "DSpark: Speculative Decoding and Semi - Autoregressive Generation Based on Confidence Scheduling". According to DeepSeek's tradition, DSpark should be pronounced as D·Spark, not DS·park.
This is the 12th paper co - authored by Liang Wenfeng since he published "DeepSeek LLM" in 2024. Moreover, this DSpark paper "collides" with Liang Wenfeng's master's thesis published in 2010.
DSpark is like installing an accelerator for DeepSeek. For users, the experience is fast, stable, and crash - free.
For responses of the same quality, the speed is directly 60% to 80% faster. A response that used to take 10 seconds now comes out in five or six seconds.
The most crucial thing is that during peak hours, DeepSeek will no longer keep "spinning".
How amazing is this DSpark? Don't worry, I'll tell you.
What exactly is DSpark,
and what old problems of DeepSeek does it solve?
The process of a large - scale model generating text is essentially a "word - guessing game". Every time the model writes a word, it has to review and calculate all the previously written words to decide what the next word should be.
For each word written, the AI has to run through the entire process from the beginning. If it writes 100 words, it has to re - process what it has written 99 times. Academically, this process of "self - regression" is called "autoregressive generation".
The whole process is like the current self competing with the previous self. If the previous step is not completed, the next step cannot proceed.
So in the past few years, the industry has been thinking about the same thing: Can the model guess a series of words at once?
This idea is the core mechanism mentioned in the DSpark paper - Speculative Decoding.
Its operating logic is as follows: Find a model that runs fast but has average performance as a draft model. Let it quickly guess several subsequent words based on its intuition, and then present this series of words to the large - scale model for verification at once.
The large - scale model takes a quick look. The words that are correctly guessed in a row are directly retained. Starting from the first wrongly guessed word, the large - scale model writes a correct one itself, and then the draft model continues to guess.
In this way, it can be ensured that the output content is approved by the large - scale model, and the speed is faster than guessing one word at a time.
It is generally believed in the industry that there are two types of speculative decoding.
The first is the "honest" approach. The draft model also guesses one word at a time. After guessing one word, it looks at the previous text and then guesses the next one. The advantage is that the output quality is higher, but the disadvantage is that it guesses too slowly, and the speed is almost the same as that of the large - scale model writing by itself.
The second approach is to simply guess all the subsequent words at once without considering anything. Although the speed is fast, when guessing words, it doesn't consider the complete previous sentence at all. It only looks at what the previous word is.
This leads to the situation that at the beginning, it's okay, but as the guessing progresses, the output quality will become lower.
The paper calls this phenomenon "suffix attenuation": The accuracy rate of the first word is okay, but it drops significantly for the second word. By the fifth or sixth word, it's basically just random guessing.
The core idea of DSpark is called semi - autoregressive generation. Simply put, it combines the above two methods.
Step one: Guess all the subsequent words at an extremely fast speed. After guessing, go back and check to see if there are any grammatical errors or typos.
Step two: DSpark assigns a "reliability score" to each word. For example, the first word gets 90 points, the second 80 points, the third 60 points, and the fourth 30 points. However, there is a problem here. After assigning the scores, DSpark knows which word is wrong. If it wants to correct it, it is equivalent to going back to the original autoregressive method, and the hard - won efficiency improvement will be lost.
So DSpark proposes a method. It measures in advance the processing speed of the large - scale model under different batch sizes, and then arranges the drafts of each request in descending order according to the reliability scores.
It first takes the batch with the highest scores from all the requests and presents them to the large - scale model for verification.
This process is fast because the quantity is small. Then it asks itself: Should it add the second batch? After adding, the large - scale model will take a little more time. 80% of the words in this batch are correct, and it can gain hundreds more correct results. Divide the extra time spent by the extra correct words to calculate an efficiency value. If it's worth it, add it. For the third batch, the accuracy rate is 60%. And so on.
According to the current busyness of the server, when the server is not busy, all the requests are sent for verification to get as many correct results as possible.
If the large - scale model is very busy at this time, only the first few high - score requests are sent for verification. The ones with a high probability of being wrong are not sent to avoid causing trouble, and the time is saved to serve more users.
The whole process is called confidence - based scheduling verification.
There have been many acceleration schemes before, but they all have a common problem. That is, they are extremely fast when tested with a single user, but they crash under high - concurrency situations.
Currently, DeepSeek lags and crashes during the evening peak hours.
Essentially, during peak hours, there are many user requests, and the batch - processing pressure on the GPU is extremely high. However, the previous MTP - 1 speculative decoding scheme wastes a large amount of computing power on verifying tokens that are likely to be wrongly guessed.
These tokens are randomly guessed by the draft model, and the large - scale model rejects them after a quick look. However, the rejection process has consumed precious GPU cycles.
The effective throughput is severely reduced. The requests pile up, the queue gets longer, and the user experience is lag or even failure to load.
After the deployment of DSpark, this problem should be alleviated.
Actual measurement data shows that under strict low - latency requirements, for example, V4 - Flash needs to ensure that each user can see 120 words per second. The previous MTP - 1 system can hardly support much concurrency before crashing, while DSpark can still maintain a throughput more than six times higher.
In a more common medium - load scenario, where each user is required to see 80 words per second, the total throughput of a single GPU with DSpark increases from 10,000 tokens per second to 15,100 tokens per second, a direct increase of 51%.
How much can the cost be reduced,
and will the quality of the answers be sacrificed?
In the AI industry, the training cost is a one - time expense, while the inference cost is continuous.
How to understand this? When you train a large - scale model, no matter whether you spend hundreds of millions or billions of dollars, the money is spent once.
Inference is different. After the model is launched, every time a user asks a question, the GPU has to run once. This cost occurs 24/7, and the more users there are, the more times it runs. It never stops.
This means that whoever can reduce the inference cost can make money. Conversely, no matter how powerful the model is, if the inference cost cannot be controlled, the larger the scale of the model, the faster the manufacturer will go bankrupt.
With the same number of GPUs, DSpark can increase the generation speed of each user by 60% to 85% without changing the hardware at all.
A response that used to take 10 seconds now comes out in five or six seconds.
DeepSeek also presents a very extreme scenario. When a hot event occurs and a large number of users flood in at the same time, if the previous system cannot handle it, either users will give up due to long queues or the system will crash directly. Expanding the capacity takes time, and you can't just add GPUs immediately.
DSpark uses dynamic scheduling. When the load is high, it automatically shortens the verification length to avoid occupying the critical batch - processing capacity. In this way, it can handle traffic spikes without expanding the capacity.
Then another question arises. Although it's faster, will DeepSeek cut corners? Will the quality of the answers decline?
The answer is zero loss.
This is determined by the mathematical nature of the speculative decoding technology itself. The rejection sampling mechanism strictly guarantees mathematically that the probability distribution of each token finally output by the large - scale model is exactly the same as the distribution when the large - scale model writes one word at a time. So from a mathematical verification perspective, the quality will not decline.
The original DSpark paper states: "the acceptance rule preserves the target distribution exactly, speculative decoding accelerates generation without any quality loss." The acceptance rule can precisely and completely preserve the target distribution, and speculative decoding can accelerate the generation process without losing the output quality.
Moreover, the paper has conducted offline accuracy tests in three fields: mathematical reasoning, code generation, and daily conversations, and there is no statistically significant difference compared with the original model.
After online deployment, there has been no user feedback indicating a decline in the quality of the answers.
Since the draft model itself is very small, accounting for less than 10% of the total computing volume, although it will have some impact on the server load, this load can be ignored in the face of the 51% actual improvement.
DeepSeek has always been known for its low cost. After reducing the inference cost by 40%, DeepSeek has more room for price cuts.
Its API pricing was already the lowest in the industry. Now that the cost is further reduced, the token price may also decrease. It is even possible to further increase the quota for free users.
More crucially, this time DeepSeek not only released the model weights but also open - sourced the entire DeepSpec training framework.
DeepSpec is a unified training toolbox specifically used to train draft models for speculative decoding. That is to say, you can use this set of tools to train draft models for your own models such as Qwen3 and Gemma.
This has further lowered the baseline of the inference cost for the entire industry.
16 - year commitment to cost - saving
In 2010, Liang Wenfeng was a master's student at Zhejiang University. His master's thesis was titled "Research on Target Tracking Algorithm Based on Low - cost PTZ Cameras".
This title now seems very "Liang Wenfeng - like".
At that time, in laboratories doing computer - vision target tracking, the standard equipment was industrial cameras worth tens of thousands of yuan, which had high precision and strong controllability. Liang Wenfeng didn't buy them. He used ordinary consumer - grade pan - tilt - zoom cameras that cost only a few hundred yuan.
His argument was that the gap in hardware can be compensated by algorithms. Through self - developed tracking algorithm optimization, he achieved a tracking accuracy with the cheap cameras close to that of expensive equipment.
16 years have passed, and Liang Wenfeng is still obsessed with using algorithms to save costs on hardware. It can be said that he has remained true to his original intention.
Why do other large - scale model companies try every means to improve performance, while DeepSeek wants to save money? Because the money is Liang Wenfeng's own.
After DeepSeek completed its financing, foreign media reported that DeepSeek, which has been established for nearly three years, has been completely supported by the profits of Magic Square Quant, founded by Liang Wenfeng, and has repeatedly refused external investment during this period.
Magic Square Quant had an average annual return rate of 56.55% in 2025, with an annual revenue of about 8.6 billion yuan. Liang Wenfeng holds 85% of the shares, and his annual dividend is in the billions. His personal assets are estimated to be between 50 billion and 100 billion yuan. In the first - round financing of over 50 billion yuan launched this year, Liang Wenfeng personally contributed 20 billion yuan, accounting for 40% of the total financing amount, making him the largest single investor.
The money from external investors does not directly enter the DeepSeek entity. Instead, it is first injected into a limited partnership where Liang Wenfeng serves as the general partner. External investors become limited partners, having only the right to receive returns and access financial information, without any voting rights. All shares are locked for five years, and transfer and withdrawal are prohibited.
At DeepSeek, Liang Wenfeng plays the roles of investor, manager, and researcher simultaneously.
Every penny of cost saved goes directly into Liang Wenfeng's own pocket.
When faced with the question of "Buy 100 more GPUs or let the team do engineering optimization", most people would choose the former. It's fast, and with OpenAI and Anthropic leading the way, and since it's the investors' money, there's no need to feel bad about it.
Liang Wenfeng chooses the latter because he knows better than anyone how many tokens this GPU has to process to recoup the cost.
The combination of these three roles in one person creates an extremely rare decision - making closed - loop in the AI industry.
The researcher proposes that "costs can be saved", the manager decides that "costs should be saved", and the investor determines that "he is willing to save costs even if he has to pay himself". There is no hierarchical reporting and no cross - departmental coordination.
DSpark is the latest product of this decision - making chain.
This article is from the WeChat official account "Letter AI", author: Miao Zheng. Republished by 36Kr with permission.