HomeArticle

The DeepSeek-V3.2 series is open-sourced, and its performance is directly comparable to that of Gemini-3.0-Pro.

量子位2025-12-02 08:52
Open-source models are once again scoring big on DS.

Raid!

On the third anniversary of ChatGPT's release, DeepSeek suddenly launched two models:

  • DeepSeek-V3.2
  • DeepSeek-V3.2-Speciale

The former focuses on balanced practicality and is suitable for daily Q&A, general Agent tasks, and tool invocation in real application scenarios.

Its inference ability reaches the level of GPT-5, slightly lower than Gemini-3.0-Pro.

The latter emphasizes extreme inference, and its performance on inference benchmarks is comparable to that of Gemini-3.0-Pro.

It also won gold medals at IMO 2025, CMO 2025, ICPC World Finals 2025, and IOI 2025 all at once.

It's worth noting that in ICPC, it reached the level of the second - ranked human contestant, and in IOI, the level of the tenth - ranked human contestant.

Specifically, DeepSeek-V3.2 focuses on balancing inference ability and output length while reducing computational overhead.

DeepSeek's official Weibo tweet stated, "The DeepSeek-V3.2 model has reached the highest level among current open - source models in Agent evaluations."

Other information about this model is as follows:

  • Its inference ability is comparable to that of GPT-5;
  • Compared with Kimi-K2-Thinking, it significantly shortens the output length, reducing users' waiting time;
  • It is the first model under DeepSeek that "integrates thinking into tool invocation" and supports tool invocation in both thinking and non - thinking modes;
  • Based on large - scale Agent training data from over 1800 environments and over 85,000 complex instructions, it has strong generalization ability.

The following figure shows the scores of DeepSeek-V3.2 and other models on various Agent tool invocation evaluation sets.

— It is particularly emphasized that DeepSeek-V3.2 has not undergone special training for the tools in these test sets.

DeepSeek-V3.2-Speciale is a long - thinking enhanced version of DeepSeek-V3.2, integrating the theorem - proving ability of DeepSeek-Math-V2.

In terms of instruction following, mathematical proof, and logical verification, DeepSeek-V3.2-Speciale excels and is recommended for highly complex mathematical reasoning, programming competitions, and academic research tasks.

Special note! This version has not been specially optimized for daily conversations and writing at present.

Moreover, it is only for research use and does not support tool invocation.

On highly complex tasks, the Speciale model significantly outperforms the standard version but consumes significantly more Tokens and incurs higher costs.

Currently, both the App and Web versions of DeepSeek have been updated to the official DeepSeek-V3.2; the Speciale version is currently only available for temporary API use.

Simultaneously with the model release, the technical report has also been published.

The technical details revealed in the paper are quite advanced:

The new sparse attention mechanism DSA significantly reduces computational complexity. The computational volume of reinforcement learning training exceeds 10% of the pre - training volume, and there is a brand - new large - scale Agent task synthesis pipeline...

Let's take a detailed look at the specific situation.

Proposing the DSA Efficient Sparse Attention Mechanism, Long Texts Are No Longer a Burden

The biggest architectural innovation of DeepSeek-V3.2 is the introduction of the DSA (DeepSeek Sparse Attention) mechanism.

The computational complexity of traditional attention mechanisms when processing long sequences is O(L²), which severely restricts the deployment efficiency of the model and the scalability of subsequent training.

DSA reduces the computational complexity to O(L·k), where k is much smaller than L.

Meanwhile, DSA significantly accelerates inference in long - context tasks without significant performance loss.

It supports FP8 precision and is compatible with the MLA (Multi - Query Attention) architecture, making it training - friendly.

How is this achieved?

DSA mainly consists of two components. One is called the lightning indexer, and the other is the fine - grained token selection mechanism.

The lightning indexer is responsible for quickly calculating the relevance scores between query tokens and historical tokens and then only selecting the top - k most relevant tokens for attention calculation.

The team specifically chose the ReLU activation function to improve throughput.

When continuing to train DeepSeek-V3.1 - Terminus, the team adopted a two - stage strategy.

The first stage is Dense Warm - up, where dense attention is maintained, and only the lightning indexer is trained to make it learn to align with the distribution of the main attention.

This stage only took 1000 steps and processed 2.1 billion tokens.

In the second stage, the sparse mechanism was introduced. Each query token selects 2048 key - value pairs, and it was trained for 15,000 steps, processing a total of 943.7 billion tokens.

The actual test results are quite impressive —

On a sequence of length 128k, the inference cost of DeepSeek-V3.2 is several times lower than that of V3.1 - Terminus.

Tests on the H800 cluster show that when the sequence length reaches 128K, the cost per million tokens in the pre - filling stage drops from $0.7 to about $0.2, and in the decoding stage, it drops from $2.4 to $0.8.

Post - training Computing Power Exceeds 10% of Pre - training

It is worth noting that the DeepSeek team has invested heavily in reinforcement learning this time.

The paper clearly states that the computational budget for RL training has exceeded 10% of the pre - training cost, which is quite rare among open - source models.

DeepSeek mentioned in the technical report that the insufficient investment of computing resources in the post - training stage of open - source models limits their performance on difficult tasks.

To address this, the team developed a stable and scalable RL protocol, enabling the computational budget in the post - training stage to exceed 10% of the pre - training cost, thus unlocking the advanced capabilities of the model.

Let's elaborate —

To stably scale up the RL computing scale, the team made several improvements based on the GRPO (Group Relative Policy Optimization) algorithm.

Firstly, there is an unbiased KL estimation, which corrects the original K3 estimator and eliminates systematic errors.

The original estimator could give unbounded gradient weights in some cases, leading to unstable training.

Secondly, there is an offline sequence masking strategy.

In actual training, to improve efficiency, large batches of rollout data are usually generated and then divided into multiple mini - batches for gradient updates. This approach itself introduces off - policy behavior.

The team masks out those negative sample sequences that deviate too far by calculating the KL divergence between the data sampling strategy and the current strategy, preventing them from interfering with training.

The team also specifically designed a Keep Routing operation for the MoE model.

Differences in the implementation of the inference framework and the training framework may cause the same input to activate different experts, resulting in sudden changes in the parameter space. By saving the routing path during inference and forcing the use of the same path during training, the consistency of parameter optimization is ensured.

In specific training, the team adopted an expert distillation strategy.

First, specialized models were trained for each task, including six fields: mathematics, programming, general logical reasoning, general Agent tasks, Agent programming, and Agent search. Each field supports both thinking and non - thinking modes.

Then, these expert models are used to generate domain - specific data to train the final model.

Breakthroughs in Agent Capabilities

In addition, the breakthroughs of the new model in Agent tasks are also eye - catching.

This time, the team found a way to enable the model to have both inference and tool - using capabilities.

In terms of thinking context management, the team found that the strategy of discarding inference content every time a new conversation starts, like in DeepSeek-R1, is extremely wasteful of tokens.

So, a new management mechanism was designed:

The historical inference content is only discarded when new user messages are introduced. If only tool - related messages are added, the inference content will be retained. Even if the inference traces are deleted, the tool invocation history and results will be retained in the context.

During the cold - start phase, the DeepSeek-V3.2 team adopted a clever prompt design.

Through a carefully designed system prompt, the model is taught to naturally insert tool invocations during the inference process.

For example, when dealing with programming competition questions, the system clearly requires the model to think first and then give an answer, and marks the inference path with special tags.

The most advanced part is that the team developed an automatic environment synthesis pipeline, generating 1827 task - oriented environments and 85,000 complex prompts.

Taking travel planning as an example, the model needs to plan a three - day itinerary under various constraints, including non - repeating cities, adjusting restaurant and attraction budgets according to hotel prices, and other complex logic.

Although it is difficult to find a solution that meets all constraints in a huge combinatorial space, verifying whether a given solution meets the constraints is relatively simple. This characteristic of "difficult to solve but easy to verify" is very suitable for RL training.

In terms of code Agents, the team mined millions of issue - PR pairs from GitHub. After strict screening and automatic environment construction, tens of thousands of executable software problem - solving environments were successfully built, covering multiple languages such as Python, Java, and JavaScript.

The search Agent uses a multi - Agent pipeline to generate training data. It first samples long - tail entities from large - scale network corpora, and then generates high - quality data through steps such as problem construction, answer generation, and verification.

The evaluation results show that DeepSeek-V3.2 achieves a 73.1% solution rate on SWE - Verified and a 46.4% accuracy rate on Terminal Bench 2.0, significantly surpassing existing open - source models.

On tool - using benchmark tests such as MCP - Universe and Tool - Decathlon, DeepSeek-V3.2 also demonstrates performance close to that of closed - source models.

These improvements prove that the model can generalize inference strategies to Agent scenarios not seen during training.