On the third anniversary of ChatGPT, it was hit hard by DeepSeek. The 23-page technical report holds all the secrets to achieving the top position in the open-source field.
On the third anniversary of ChatGPT's birth, DeepSeek presents a "birthday gift."
Just now, DeepSeek released two models in one go : DeepSeek-V3.2 and DeepSeek-V3.2-Speciale. These two models not only come close to GPT-5 and Gemini-3.0-Pro in inference ability, but more importantly, they solve a problem that has troubled open-source models for a long time:
How to make AI both capable of deep thinking and proficient in using tools?
The summary of the new models is as follows
- DeepSeek-V3.2 (Standard Edition): Focuses on cost-effectiveness and daily use. Its inference ability reaches the level of GPT-5. It has shorter output, faster speed, and lower cost than Kimi-K2-Thinking, and for the first time, it realizes "thinking while using tools." The official website, APP, and API have all been upgraded to this version, which is suitable for daily Q&A, writing, and Agent tasks.
- DeepSeek-V3.2-Speciale (Ultimate Enhanced Edition): Aims at exploring the upper limit of AI capabilities. Its performance is comparable to Gemini-3.0-Pro. It won gold medals in the 2025 IMO, IOI, and ICPC (ranked 10th among humans in the IOI and 2nd in the ICPC). Only a temporary API is provided. It has a long thinking chain, high Token consumption, and high cost. It does not support tool calls and has not optimized daily conversations. The service will end on December 15, 2025.
The weights of both models have been open-sourced on HuggingFace and ModelScope. You can download them for local deployment.
Slow, stupid, and dull? DeepSeek V3.2 introduces new black technology
In the past few months, there has been an obvious trend in the AI circle: Closed-source models are getting faster and faster, while open-source models are a bit left behind. After analysis, the DeepSeek team found that open-source models have three core bottlenecks when dealing with complex tasks: architecture issues, resource allocation, and agent capabilities.
In response to these three problems, DeepSeek has come up with three major tricks this time.
If you have used some AI models to process extremely long documents, you may have noticed that the speed becomes slower and slower, or even the system freezes. This is the fault of the traditional attention mechanism.
The logic of the traditional attention mechanism is: Each word needs to calculate its correlation with all previous words. The longer the document, the greater the computational volume. It's like chatting with someone in a WeChat group of 1000 people. You have to confirm one by one whether each of these 1000 people is the one you are looking for before speaking, which is obviously a very laborious thing.
The DSA (Sparse Attention Mechanism) introduced by DeepSeek this time takes a different approach: It doesn't need to focus on every word, but only on the truly important parts.
Its core is something called the "Lightning Indexer."
This indexer will quickly assign a score to each word, and then only select the words with the highest scores to calculate attention. It's like in a group of 1000 people, you first use the search function to filter out those with "Zhang" in their names, and then find the Zhang San you want from these 50 people. The efficiency immediately improves.
What's even smarter is that the Lightning Indexer uses very little computational resources and supports FP8 precision calculation (a low-precision but efficient calculation method), so it itself will not become a new performance bottleneck.
What's the actual effect? V3.2 supports a context length of 128K, which is equivalent to the length of a medium-length novel, but the processing speed and efficiency have been greatly improved. And according to the official tests in various scenarios, the performance of the DSA version is not inferior to the traditional attention mechanism at all, and is even better in some scenarios.
V3.2 is based on the previous version, V3.1-Terminus, and DSA is introduced through continuous training. The whole process is divided into two stages, and the same data distribution as when V3.1-Terminus was extended to 128K is used, ensuring a smooth transition of the model's capabilities.
In addition, having a good architecture is not enough; training also needs to keep up.
Another gap between open-source models and closed-source models is that open-source models invest too little computational resources in the later stage of training. It's like building a house. When the budget is used up, the decoration is done casually, and finally, you find there are problems everywhere when you move in.
The technical report shows that DeepSeek's computational budget in the post-training stage exceeds 10% of the pre-training cost. But spending money is also a skill. DeepSeek has built a "stable and scalable reinforcement learning training framework," which has two characteristics.
One is stability. Reinforcement learning training itself is not very stable, and problems such as training crashes and performance fluctuations are prone to occur. DeepSeek's framework can keep the training stable under large-scale computing, which is itself a technological breakthrough.
The other is scalability. This framework allows the computational budget in the post-training stage to greatly exceed the traditional approach, thereby unlocking the advanced capabilities of the model.
The specific training process is divided into two steps.
The first step is "expert distillation." They trained dedicated expert models in six major professional fields, such as mathematics, programming, logical reasoning, and agent tasks. Each expert model is trained under large-scale reinforcement learning computing, and training data is generated for both the "thinking mode" (long-chain thinking) and the "non-thinking mode" (direct answering).
After the expert models are trained, they are used to generate the training data for the final model. The experimental results show that the performance of the model trained with this expert-distilled data is only slightly lower than that of the corresponding expert model, and this gap can be smoothed out in the subsequent reinforcement learning training.
The second step is "hybrid reinforcement learning training." DeepSeek continues to use GRPO (Group Relative Policy Optimization) as the main training algorithm, integrating inference tasks, agent tasks, and human preference alignment tasks into a single reinforcement learning stage.
The advantage of this unified training is that it can improve the performance in different task domains and avoid the common "catastrophic forgetting" problem in traditional multi-stage training. You can understand it as: AI will not forget old skills while learning new ones.
In inference and agent tasks, they use rule-based result rewards, output length penalties, and language consistency rewards to guide the model to learn. In general tasks, a generative reward model is used, and evaluation criteria are defined for each prompt.
V3.2 is a stable version obtained after thousands of steps of training under this hybrid reinforcement learning. The Speciale version is even more radical. It is only trained on inference task data, reduces the output length penalty, and introduces the dataset and reward mechanism of DeepSeekMath-V2 to further enhance the mathematical proof ability.
As a result, the inference ability of V3.2 directly catches up with GPT-5, and the performance of the Speciale version is even closer to Gemini-3.0-Pro because it removes the limit on thinking length.
Thinking + tool call: AI learns to "think and do at the same time"
There used to be an embarrassing problem with the previous DeepSeek models: After entering the "thinking mode," they could not call tools such as search and code execution. It's like a person's hands stop moving after falling into deep thought. This obviously does not conform to the way we solve complex problems.
In reality, when we encounter difficult problems, we often look up information while thinking, analyze and verify at the same time. Thinking and action are intertwined. AI should be the same.
The DeepSeek team found that if they directly replicated the strategy of DeepSeek-R1 (discarding the previous inference content after receiving the second-round message), it would seriously reduce the Token usage efficiency. This method would force the model to repeat the entire reasoning process from the beginning every time it calls a tool, resulting in a waste of resources.
They specifically designed a "thinking context management mechanism" for tool call scenarios.
The core logic is: Only when the user sends a new message will the historical inference content be cleared. If only tool-related information (such as tool output results) is appended, the previous inference content will be retained, allowing the reasoning process to continue.
At the same time, when the inference content is removed, the tool call history and the results returned by the tools will still be retained in the context, ensuring that the model can still make judgments based on the existing information in subsequent reasoning.
In this way, AI can: think for a while, call tools (such as search and run code), continue thinking after seeing the results, and then call tools again, and so on. And the historical inference content will be retained, so there is no need to start thinking from the beginning again every time a tool is called.
The official example is very vivid: Planning a complex three-day trip that needs to meet various budget constraints, rating requirements, and non-repetition principles. For example, on the second day, if a luxury hotel (over 800 yuan) is booked, the total cost of lunch and dinner cannot exceed 350 yuan, the ratings of the restaurants must be above 4.0, and the afternoon scenic spot tickets must be less than 120 yuan. If a mid-range to high-end hotel (500 to 800 yuan) is booked, at least one restaurant's rating must reach 4.0, and the scenic spot tickets must be less than 180 yuan.
This kind of task requires AI to repeatedly query hotel, restaurant, and scenic spot information, and at the same time perform logical reasoning and constraint checks. V3.2 can think while searching and finally give a perfect answer.
However, note that some agent frameworks (such as Roo Code or Terminus) simulate tool interactions through user messages. Due to their context management method, they may not be able to fully utilize the advantages of the inference content retention mechanism. For such systems, the official recommends using the "non-thinking mode" first.
The thinking mode of V3.2 already supports Claude Code and can be used in command-line tools. However, components such as Cline and RooCode that use non-standard tool calls are not very compatible yet. Please note when using them.
In the process of realizing "thinking + tool call," DeepSeek also made a clever design called "cold start." Considering that there are two types of data on hand (one is non-agent data with a reasoning process, and the other is agent task data without a reasoning process), they combined the two through carefully designed prompts.
They believe that the model already has a strong ability to understand instructions. Through clear instructions, the model can naturally integrate tool execution into the reasoning process. This enables "tool use" to seamlessly integrate into the "reasoning process" and achieve the integration of capabilities in the cold start stage.
Large-scale agent tasks: Let AI train itself
In terms of improving the capabilities of large models, DeepSeek has taken a different path - not having humans teach AI, but letting AI train itself.
They built a large-scale agent task pipeline, creating more than 1800 virtual environments and more than 80,000 tasks. These tasks have a common feature: They are difficult to answer but easy to verify. What does this mean? It means that the questions are very complex, but it's easy to tell whether the answers are correct. In this way, AI can brush questions, modify questions, and review without limit, continuously strengthening its reasoning ability.
On this pipeline, different agents play different roles: Some are responsible for mining knowledge from the Internet and raising questions; some are responsible for generating various answers; and some are responsible for verifying whether the answers are correct. Only the data that passes the verification will enter the training set. This makes the model smarter and smarter during training and prevents it from learning in the wrong direction.
What's even more hardcore is in the code field. DeepSeek directly grabs real Issues and fix patches from GitHub, and lets agents build test environments, install dependencies, and run test cases to ensure that bug fixes are really effective and do not introduce new problems. After rounds of automated tempering, the model has obtained practical capabilities in multiple programming languages.
Finally, the most amazing part is the general agent. It can not only solve problems but also automatically generate tasks, tools, and verification logic. Given a task type, such as travel planning, it will collect data by itself, generate tools, increase the difficulty, and iterate on solutions until a complete task system is formed. Eventually, it creates thousands of environments and tasks, truly realizing a world where AI generates data to train AI.
In a nutshell: DeepSeek has changed the training process from "humans feeding data" to "AI creating data, verifying data, and becoming stronger with data." This not only improves the model's logical ability but also gives AI a feature that did not exist before - self-evolution.
How amazing are the test results?