Liang Wenfeng-Signed Paper: DeepSeek's Most Powerful Open-Source Agent Model Makes a Big Splash

DeepSeek wants to bring open-source models back to the first echelon.

DeepSeek aims to bring open - source models back to the first echelon.

According to a report by Zhidongxi on December 2nd, last night, DeepSeek released two new models: DeepSeek - V3.2 and DeepSeek - V3.2 - Speciale. These are currently the most powerful models of DeepSeek, achieving the top performance among global open - source models in benchmark tests across multiple fields such as reasoning and agents.

DeepSeek stated that the standard version of DeepSeek - V3.2 has reached the level of GPT - 5 and is only slightly inferior to Gemini - 3.0 - Pro in public reasoning - related benchmark tests. Compared with Kimi - K2 - Thinking, the output length of V3.2 is significantly reduced, notably decreasing the computational overhead and user waiting time.

The long - thinking enhanced version, DeepSeek - V3.2 - Speciale, combines the theorem - proving ability of DeepSeek - Math - V2 and has strong instruction - following, mathematical - proving, and logical - verification capabilities. Its performance in mainstream reasoning benchmark tests is comparable to that of Gemini - 3.0 - Pro.

In the open - source world, DeepSeek - V3.2 also leads the way. According to data from the authoritative large - model evaluation platform Artificial Analysis, before the inclusion of DeepSeek - V3.2, the open - source model with the highest intelligence level in the industry was Kimi - K2 - Thinking.

In benchmark tests where the results of both DeepSeek - V3.2 and Kimi - K2 - Thinking are announced and the test settings are the same, DeepSeek - V3.2 outperforms Kimi - K2 - Thinking.

Comparison of benchmark tests between DeepSeek - V3.2 and Kimi - K2 - Thinking. Data source: official channels

DeepSeek - V3.2 is also the first model launched by DeepSeek that integrates thinking into tool usage and supports tool invocation in both thinking and non - thinking modes.

The DeepSeek - V3.2 model has reached the highest level among current open - source models in agent evaluations, significantly narrowing the gap between open - source and closed - source models. It's worth noting that V3.2 has not undergone special training for the tools in these test sets, which means it can demonstrate strong generalization ability in real - world application scenarios.

In addition, the DeepSeek - V3.2 - Speciale model has successfully won gold medals in IMO 2025 (International Mathematical Olympiad), CMO 2025 (Chinese Mathematical Olympiad), ICPC World Finals 2025 (International Collegiate Programming Contest World Finals), and IOI 2025 (International Olympiad in Informatics). Among them, the results of ICPC and IOI have reached the levels of the second and tenth human contestants respectively.

In highly complex tasks, the Speciale model significantly outperforms the standard version but consumes significantly more tokens and incurs higher costs. Currently, DeepSeek - V3.2 - Speciale is only available for research use, does not support tool invocation, and has not been specifically optimized for daily conversation and writing tasks.

Currently, the official web version, App, and API of DeepSeek have all been updated to the official version of DeepSeek - V3.2. The Speciale version is currently only available in the form of a temporary API service for community evaluation and research. The DeepSeek - V3.2 series of models have been open - sourced, and the technical report has been released simultaneously.

It's worth mentioning that in the author list of the technical report, we can see many familiar names, such as Liang Wenfeng, the founder and CEO of DeepSeek, and Chen Deli, a researcher who represented DeepSeek at the Wuzhen World Internet Conference some time ago.

Technical report:

https://modelscope.cn/models/deepseek - ai/DeepSeek - V3.2/resolve/master/assets/paper.pdf

Open - source link:

DeepSeek - V3.2

https://modelscope.cn/models/deepseek - ai/DeepSeek - V3.2

DeepSeek - V3.2 - Speciale

https://modelscope.cn/models/deepseek - ai/DeepSeek - V3.2 - Speciale

01. Is the gap between open - source and closed - source models widening? DeepSeek identifies three reasons

Why has the gap between open - source and proprietary models been continuously expanding in the past few months? This is a question that the DeepSeek team has been pondering.

The DeepSeek team believes that there are mainly three factors limiting the ability of open - source models in complex tasks.

Firstly, in terms of architecture, open - source models mainly rely on the original attention mechanism, which severely restricts the efficiency of long - sequence processing. This inefficiency poses significant obstacles to large - scale deployment and the effective post - training stage.

Secondly, in terms of resource allocation, open - source models have insufficient computing power investment in the post - training stage, which limits their performance in difficult tasks.

Finally, in the context of agent applications, compared with proprietary models, open - source models lag significantly in generalization ability and instruction - following ability, which hinders their effectiveness in real - world deployment.

To address these key limitations, DeepSeek first introduced DSA (DeepSeek Sparse Attention), an efficient sparse attention mechanism, aiming to significantly reduce computational complexity. This architecture effectively resolves the efficiency bottleneck and maintains model performance even in long - context scenarios.

Secondly, DeepSeek developed a stable and scalable reinforcement learning protocol that allows for large - scale computing power expansion in the post - training stage. Notably, the post - training computing power budget allocated by this framework exceeds 10% of the pre - training cost, which is relatively rare in the industry, thus unlocking the advanced capabilities of the model.

Thirdly, DeepSeek proposed a novel process to promote generalized reasoning in tool - usage scenarios. The R & D team implemented a cold - start phase using the DeepSeek - V3 method, unifying reasoning and tool usage in a single trajectory.

Subsequently, they advanced to large - scale agent task synthesis, generating over 1800 different environments and 85000 complex prompts. These extensively synthesized data drive the reinforcement learning process, significantly enhancing the model's generalization ability and instruction - following ability in the context of agents.

02. Built on the final version of DeepSeek - V3.1, DSA makes model computation smarter

The architecture used by DeepSeek - V3.2 is exactly the same as that of the previously released experimental version, DeepSeek - V3.2 - Exp. Compared with the last version of the DeepSeek - V3.1 series, DeepSeek - V3.1 - Terminus, the only architectural change in DeepSeek - V3.2 is the introduction of DSA through continuous training.

The traditional attention mechanism needs to calculate each token with all the previous tokens when processing a token, which is very time - consuming in long texts. The idea behind DSA is to quickly select the most important tokens first and then conduct detailed analysis only on these tokens.

This selection is achieved through a lightning indexer. The lightning indexer calculates the index scores between the query token and the previous tokens to determine which tokens should be selected for calculation. Given that the lightning indexer has a small number of heads and can be implemented in FP8, its computational efficiency is excellent.

After obtaining the index scores for each query token, the fine - grained token selection mechanism retrieves only the key - value entries corresponding to the top - k index scores and calculates the output.

The training of DeepSeek - V3.2 starts from the DeepSeek - V3.1 - Terminus base checkpoint with the context length already extended to 128K.

During the continued pre - training process, the model first undergoes a "dense warm - up", where the full attention remains unchanged, and only the indexer is trained to learn to mimic the distribution of the original attention.

Subsequently, it enters the sparse training stage, where the real token selection mechanism is introduced, and the entire model is optimized simultaneously. Through this gradual transition, the model can smoothly migrate from dense attention to a sparse structure without a significant drop in performance.

In terms of ability evaluation, DeepSeek - V3.2 - Exp has shown results comparable to or even better than its predecessors in standard benchmark tests, human preference evaluations, and multiple long - context tasks.

Whether it is the Elo score on ChatbotArena or long - sequence tests such as AA - LCR and Fiction.liveBench, they all indicate that the model quality has not been sacrificed after the introduction of sparse attention. Instead, it has gained obvious advantages in long - sequence reasoning.

In terms of actual reasoning cost, DSA reduces the core attention complexity of the model from quadratic to approximately linear growth, making the savings more obvious as the sequence length increases. Although the indexer itself still needs to process global information, its overhead is much lower than that of the original MLA.

Combined with engineering optimizations, DeepSeek - V3.2 has achieved significant end - to - end acceleration on H800 GPUs and further improved efficiency in short - context scenarios using a specialized masking pattern. Overall, DeepSeek - V3.2 effectively breaks through the performance bottleneck of long - context reasoning while maintaining its capabilities.

DeepSeek - V3.2 has achieved significant end - to - end acceleration on H800 GPUs

03. Creating six types of exclusive models to let the model generate post - training data for itself

The post - training stage of DeepSeek - V3.2 follows the continuous pre - training. Its goal is to further shape a large - scale but unrefined base model into a final version with reasoning, tool - usage, agent task, and alignment capabilities.

The entire process continues the approach of DeepSeek - V3.2 - Exp and still conducts efficient training based on sparse attention. Post - training mainly relies on two routes: one is expert distillation, and the other is hybrid reinforcement learning. Combining the two enables the model to achieve stable and balanced ability improvement in different fields.

The core idea of expert distillation is that different tasks are learned by specialized expert models, and then the capabilities of these experts are aggregated into a unified large - scale model.

The team first starts from the same DeepSeek - V3.2 base checkpoint and trains exclusive models for six types of professional tasks, including mathematics, programming, logical reasoning, general agents, agent programming, and agent search. These models have two types of data: thinking mode and direct - answering mode, and are enhanced using large - scale RL to ensure that each expert reaches a high level in its own field.

Subsequently, these experts will be responsible for generating high - quality domain - specific data to train a unified large - scale model. Experiments show that the performance of the large - scale model distilled from expert data is already very close to that of each expert itself. With subsequent RL fine - tuning, the remaining gap can be basically eliminated.

In the hybrid reinforcement learning stage, the GRPO (Group Relative Policy Optimization) algorithm is continued to be used, integrating the training of reasoning, agents, and human alignment into the same stage, thus avoiding the common catastrophic forgetting in multi - stage training.

Reasoning and agent tasks mainly rely on rule - based rewards, length penalties, and language consistency rewards; while general tasks are scored by a generative reward model according to specific rubrics. The advantage of this approach is that the model will not be biased towards a certain type of task and can maintain a stable balance of capabilities overall.

To ensure the stable progress of reinforcement learning under large - scale computing, the team has made multiple improvements to GRPO, enabling the large - scale model to maintain good convergence characteristics during long - term and high - intensity training.

In post - training, DeepSeek - V3.2 focuses on solving the problem of "how to combine thinking mode and tool usage". To prevent the model from frequently repeating reasoning during multiple rounds of tool invocation, they designed a new context management mechanism: the thinking trajectory is only cleared when a new user message appears, and the addition of tool outputs will not cause the reasoning content to be discarded.

Meanwhile, the tool invocation history is still fully retained to ensure that the model can continuously use the existing reasoning to complete subsequent actions. In the early stage of training, since the reasoning data and agent data come from different sources, the model needs a cold - start method to splice the "thinking while using tools" mode together. Therefore, the team designed specific system

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

A paper signed by Liang Wenfeng: The most powerful open-source Agent model of DeepSeek has made a big splash.

01. Is the gap between open - source and closed - source models widening? DeepSeek identifies three reasons

02. Built on the final version of DeepSeek - V3.1, DSA makes model computation smarter

03. Creating six types of exclusive models to let the model generate post - training data for itself