Just now, DeepSeek V4 updated DSpark, boosting inference speed by 80%
Just now, DeepSeek V4 had an update.
A new speculative decoding framework DSpark was launched, and the full-stack speculative decoding framework DeepSpec supporting this version was open-sourced simultaneously.
DeepSeek-V4-Pro-DSpark is not a brand-new architecture model. Instead, it introduces a speculative decoding module based on DeepSeek-V4-Pro. The focus of this update lies in engineering implementation rather than the iteration of the model's capabilities itself.
DSpark has been deployed in the real online traffic of DeepSeek-V4 (Flash and Pro), significantly accelerating the inference speed of large language models (LLMs).
Technical report: "DSpark: Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation"
Link to the technical report: https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf
The core intention of DSpark is to solve the latency and throughput bottlenecks faced by LLM inference in production environments (especially in high-concurrency scenarios). In short, DSpark successfully combines high-throughput "parallel generation" with adaptive "load-aware verification".
Speculative decoding is a technology that accelerates the inference of large language models without changing the output distribution of the model. Its core idea is to introduce a lightweight "draft model" to pre-generate several candidate tokens, and then the target model verifies and accepts these candidates in batches. This transforms the serial token-by-token generation into parallel batch verification, significantly reducing the end-to-end latency.
On this basis, DSpark's innovation lies in the introduction of the semi-autoregressive generation architecture: It retains the high-throughput advantage of the parallel draft model and adds a lightweight serial module to model the dependency relationship between tokens within a block, alleviating the acceptance rate attenuation problem that the parallel draft model is prone to at subsequent positions.
In addition, there is hardware-aware confidence-scheduled verification: Previous speculative decoding usually blindly sent all the generated draft tokens for verification. When the system is under high load, these tail tokens with a high probability of being rejected will seriously waste precious batch processing computing power. DSpark introduces a confidence head to evaluate the survival probability of each token. Combined with the hardware-aware prefix scheduler, the system can dynamically customize the optimal verification length for each request according to the real-time engine throughput characteristics, and allocate computing power only to the tokens with the highest expected return.
To be implemented in real online infrastructure, DSpark's scheduler uses an asynchronous mechanism to be compatible with zero-overhead scheduling (ZOS) and continuous CUDA graph replay. It uses the historical predictions of the previous two steps to determine the current dynamic truncation length, hiding the scheduling latency, avoiding GPU pipeline stalls, and ensuring the complete lossless restoration of the target model's output distribution.
In tests covering multiple fields such as mathematical reasoning, code generation, and daily conversations, DSpark significantly outperforms the current state-of-the-art autoregressive model (Eagle3) and parallel draft model (DFlash). For example, on the Qwen3 series (4B, 8B, 14B) target models, its average acceptance length is increased by 26.7% to 30.9% compared with Eagle3 and 16.3% to 18.4% compared with DFlash.
Compared with the single-token production benchmark (MTP-1) deployed in the previous generation, while maintaining the same overall throughput, DSpark increases the user's generation speed by 60%-85% (Flash model) and 57%-78% (Pro model).
Together with DSpark, DeepSpec is also open-sourced. It is a full-stack code library for training and evaluating speculative decoding draft models. It is the "open-source infrastructure" that supports this solution and other cutting-edge algorithm implementations, including data preparation tools, draft model implementation, training code, and evaluation scripts.
DeepSpec divides the overall process into three stages: data preparation, training, and evaluation. The three stages need to be run in sequence, and the output of the previous stage will be the input of the next stage.
In the data preparation stage, you need to download the prompt data, regenerate the answers for the target model using the inference engine, and build the target cache. It is worth noting that, taking the default Qwen/Qwen3-4B configuration as an example, the target cache volume can reach about 38 TB, and you need to fully evaluate the storage resources before use.
The training stage can be started by running bash scripts/train/train.sh. This script will call train.py and start a worker for each visible GPU. Users can select different algorithm and target model configurations in the config/ directory by specifying the config_path. The project also supports adjusting the training settings by overriding the config_path, target_cache_dir, and using --opts to modify individual configuration fields.
In terms of hardware, the default configuration and scripts of DeepSpec are for a single-node 8-GPU environment. If the number of GPUs is less, users need to reduce the number of visible GPUs in CUDA_VISIBLE_DEVICES accordingly.
The evaluation stage is started by running bash scripts/eval/eval.sh. The evaluation script will use the trained draft model checkpoint to measure the acceptance situation on multiple speculative decoding benchmark tasks. The evaluation datasets currently listed in the project include GSM8K, MATH500, AIME25, HumanEval, MBPP, LiveCodeBench, MT-Bench, Alpaca, and Arena-Hard-v2, covering different task types such as mathematical reasoning, code generation, dialogue ability, and comprehensive Q&A.
In terms of algorithms, DeepSpec currently has three built-in draft models: DSpark, DFlash, and Eagle3. In terms of target model series, the project currently supports Qwen3 and Gemma.
The open-source of DeepSpec integrates the engineering practices of speculative decoding, which were previously scattered among various research teams, into a set of reproducible and extensible standardized toolchains. For researchers and engineers who want to accelerate the inference of their own large models, this means that they can directly train customized draft models on a mature framework and skip a lot of repeated infrastructure building work.
Reference links:
https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf
https://github.com/deepseek-ai/DeepSpec
This article is from the WeChat official account "Machine Intelligence" (ID: almosthuman2014), written by Zenan and Yang Wen. It is published by 36Kr with authorization.