The Widening Gap between Open - Source and Closed - Source Models: Harsh Truth from DeepSeek Paper

A rather pessimistic fact.

On December 2nd, DeepSeek released its V3.2 technical report. In this paper, they did something rare: clearly stating that the performance gap between open - source large models and closed - source models is not narrowing but widening.

This is a sober judgment based on a large amount of actual measurement data.

The gap is widening, and this is a fact

In 2024, when open - source models such as DeepSeek, Qwen, and GLM were successively released, the community was filled with optimism. The saying of an "eight - month time difference" was widely circulated, and many people believed that open - source was catching up with closed - source. However, in 2025, the situation changed.

DeepSeek wrote bluntly in the introduction part of the paper: "There has been an obvious divergence in the past few months. Although the open - source community has continued to make progress, the performance improvement speed of closed - source proprietary models is significantly faster. As a result, the gap between the two is not narrowing but widening, and closed - source systems are showing an increasingly stronger advantage in complex tasks."

This observation is supported by data. The paper compared the performance of DeepSeek V3.2, GPT - 5, and Gemini 3.0 Pro in multiple benchmark tests. In MMLU - Pro (Multi - Disciplinary Knowledge Test), DeepSeek V3.2 scored 85.0, GPT - 5 scored 87.5, and Gemini 3.0 Pro reached 90.1. In the GPQA Diamond (Graduate - Level Scientific Questions) test, their scores were 82.4, 85.7, and 91.9 respectively.

The more obvious gap is reflected in HLE (Human Last Exam, an extremely difficult text reasoning test). DeepSeek V3.2 scored 25.1, GPT - 5 scored 26.3, and Gemini 3.0 Pro scored as high as 37.7 - this gap can no longer be described as "close".

It is worth noting that DeepSeek V3.2 is currently the strongest open - source model and leads in most comparisons of open - source models. However, even so, there is still an obvious gap between it and top - level closed - source models, especially in scenarios that require in - depth reasoning and complex task processing.

Why is the gap widening? Three structural problems

Through systematic analysis, the paper identified three key defects that limit the ability of open - source models in complex tasks. These are not superficial problems but deep - seated structural dilemmas.

The first problem lies at the architectural level.

Open - source models generally rely on the traditional vanilla attention mechanism, which is extremely inefficient in processing long sequences.

The paper pointed out that this architectural dependence "severely limits the efficiency of long sequences and poses a substantial obstacle to scalable deployment and effective post - training". While closed - source models are already exploring more efficient attention mechanisms, open - source models are still using the technical architecture from five years ago, which is a huge disadvantage in itself.

The second problem is the gap in resource investment, especially in the post - training stage.

Post - training is a key link to transform a model from "able to talk" to "able to think", which requires the model to learn reasoning, tool use, and follow complex instructions through reinforcement learning. The paper revealed that the post - training computing budget of DeepSeek V3.2 exceeded 10% of the pre - training cost. It should be noted that pre - training itself is an extremely expensive investment, and the post - training budget of most open - source models may be less than 1%. This gap in resource investment directly leads to a generational difference in performance.

The third problem is the lag in AI Agent capabilities.

In real - world application scenarios, the generalization ability and instruction understanding ability of open - source models are significantly behind. The paper cited three key Agent evaluation benchmarks: in MCP - Mark, DeepSeek V3.2 scored 45.9, while Gemini 3.0 Pro scored 51.0; in MCP - Universe, the former scored 80.3, and the latter scored 87.9; in Tool - Decathlon, the gap is even more obvious. These numbers reflect the insufficient capabilities of open - source models in scenarios such as complex multi - round interactions, tool calls, and long - term planning.

The paper concluded: "Open - source models show an obvious lag in generalization ability and instruction - following ability, which hinders their effectiveness in actual deployment." This is an honest and cruel judgment.

DeepSeek's response: A fundamental change in the technical route

After recognizing the problems, DeepSeek did not choose to simply stack parameters or increase the amount of data but made fundamental technological innovations in three core dimensions.

At the architectural level, DeepSeek introduced the DSA (DeepSeek Sparse Attention) mechanism.

The computational complexity of the traditional attention mechanism is O(L²). When the sequence length doubles, the computational volume quadruples. DSA quickly calculates the importance score of each token through the "Lightning Indexer" and then only selects the top - k most important tokens to participate in the attention calculation (k = 2048 in the paper), reducing the complexity from O(L²) to O(L×k).

This improvement is not just a theoretical optimization. The paper showed through actual measurement data that under a 128K context length, DSA significantly reduced the inference cost with almost no loss in performance. More surprisingly, in AA - LCR (Long - Text Reasoning Benchmark) and Fiction.liveBench (Novel Comprehension Test), V3.2 performed even better than V3.1 using the traditional attention mechanism. This proves that DSA is not only faster but also better in quality in some scenarios.

At the resource investment level, DeepSeek made an extraordinary decision.

The paper clearly wrote: "In recent months, performance improvement has been continuously correlated with the expanding RL training budget, which has exceeded 10% of the pre - training cost." This figure is extremely rare in the open - source community. Specifically, DeepSeek trained expert models for six major fields such as mathematics, programming, reasoning, and Agent respectively, and each was separately trained with large - scale reinforcement learning. In the continuous pre - training stage, the model underwent training with 943.7B tokens (under a 128K context length) and then used the GRPO (Group Relative Policy Optimization) algorithm for hybrid training to integrate three types of tasks: reasoning, Agent, and human alignment.

In terms of Agent ability enhancement, DeepSeek developed a systematic task synthesis process.

They synthesized more than 1,800 diverse environments and 85,000 complex prompts, covering various real - world scenarios. Specifically, it includes 24,667 code Agent tasks, 50,275 search Agent tasks, 4,417 general Agent tasks, and 5,908 code interpreter tasks. These synthetic data are not randomly generated but are generated by learning a unified pattern of reasoning and tool use in the cold - start stage and then systematically generating high - quality training scenarios in the scaling - up stage.

The effect is remarkable. In Agent - related tests, DeepSeek V3.2 significantly narrowed the gap with closed - source models, achieving an 80.3% success rate in MCP - Universe. Although it is still lower than Gemini's 87.9%, it is already the best performance among open - source models. The paper concluded: "DeepSeek V3.2 has become a cost - effective choice in Agent scenarios, significantly narrowing the performance gap between open - source and cutting - edge closed - source models."

The paper ended with a thought - provoking sentence: "If Gemini 3.0 proves the potential of continuous pre - training expansion, DeepSeek V3.2 - Speciale proves the scalability of reinforcement learning in a large - scale context environment." The implication is obvious: closed - source giants have the resources to stack pre - training, but open - source can find its own way - through a more efficient architecture and more scientific post - training, achieving similar results with fewer resources.

This may be the only way for open - source AI to survive: not competing head - on in terms of resources but competing in the innovation of technical routes. At least this time, DeepSeek has proved that this path is feasible.

Paper link: https://arxiv.org/html/2512.02556v1#S5

Compiled by: Zhou Huaxiang

This article is from the WeChat official account "Silicon Star GenAI", written by the Large - Model Mobile Group, and published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The gap between open-source and closed-source models is widening: This is the harsh truth revealed by the DeepSeek paper.

The gap is widening, and this is a fact

Why is the gap widening? Three structural problems

DeepSeek's response: A fundamental change in the technical route