Deepseek V4 First Wave Reviews: Huge Success or Big Flop? (With Ranking List)

DeepSeek V4 leads in open-sourcing code capabilities, with prices as low as 1% of competitors'.

After the open - source preview version of DeepSeek V4 was launched, the first wave of evaluation results from third - party lists has been released.

Multiple evaluations show that the performance of DeepSeek V4, especially in code tasks, has entered the first echelon of open - source models. At the same time, with "million - level context + low price", it has further lowered the usage threshold for developers.

According to third - party evaluations, the evaluation platform Arena.ai on X characterized V4 Pro (thinking mode) as "a significant leap compared to DeepSeek V3.2". In its code arena, it ranked 3rd among open - source models and 14th overall. Another evaluation party, Vals AI, said that V4 "overwhelmingly" topped the list of open - source weighted models in its Vibe Code Benchmark, defeating closed - source models such as Gemini 3.1 Pro and achieving about a 10 - fold performance leap compared to the previous generation V3.2.

In terms of pricing, the output price of V4 - Flash is $0.28 per million tokens, more than 99% lower than Claude Opus 4.7. The output price of V4 - Pro is $3.48. It is one of the lowest - priced options among cutting - edge models of the same level. The comparison table shows that Flash is in the lowest range of small - scale models, and Pro is also in the lower range of the "cutting - edge large - scale models".

Discussions around the actual experience have started to diverge. Many netizens on X said that its cost - performance ratio "shatters the market". However, DeepSeek remained cautious in its self - description materials, stating that in terms of knowledge and reasoning, it is close to closed - source systems but still has a gap of about 3 to 6 months. At the same time, it warned that "limited by high - end computing power", the throughput of the Pro service is limited, and there is an expectation of price reduction in the future.

Third - party evaluations: Leading in code capabilities, closely chasing the top in overall ranking

Shortly after the release of OpenAI GPT - 5.5, the preview version of DeepSeek - V4 was officially launched and open - sourced simultaneously. It includes V4 - Pro with a total of 16 trillion parameters (49B active parameters) and V4 - Flash with a total of 284 billion parameters (13B active parameters). Both models support a super - long context window of 1 million tokens and adopt the MIT open - source license.

On the day of V4's release, the model evaluation platform Arena.ai announced that DeepSeek V4 Pro (thinking mode) ranked 3rd among open - source models and 14th overall in its code arena, and characterized this release as "a significant leap compared to DeepSeek V3.2". Arena.ai also tested V4 Flash, and both models support a 1 - million - token context.

The evaluation results of Vals AI are even more remarkable. The platform said that DeepSeek V4 "overwhelmingly" became the number one open - source weighted model in its Vibe Code Benchmark, not only surpassing the second - ranked Kimi K2.6 but also defeating closed - source cutting - edge models such as Gemini 3.1 Pro.

Vals AI particularly emphasized that V4 achieved about a 10 - fold performance leap compared to V3.2 - "V3.2 only scored 5 points on this benchmark, and this is not a typo." In the Vals comprehensive index ranking, V4 finished 2nd, only 0.07% behind the top - ranked Kimi K2.6.

The community's response has been very positive. On the X platform, user Sigrid Jin said it brought a new "shocking moment" and mentioned that "now you can run a model similar to GPT 5.4 at home". He wrote:

"Sorry, GPT - 5.5. DeepSeek V4 is the new shocking moment. It defeated GPT - 5.4 in high - intensity mode in the code arena."

User Ejaaz said:

"China is leading in AI. They have caught up. DeepSeek V4 Flash is 99% cheaper than Opus 4.7, only costing $0.28 per million tokens, and it ranked first in the code arena. This is not a typo."

Some users expressed reservations. X user Michael Anti said after trying it that the actual experience of V4 Flash did not surpass the previously well - developed V3.2, and thought the upgrade experience was disappointing for old users.

Official self - evaluation: Cautious wording, the smallest gap in code and Agent fields

DeepSeek has always been prudent in its comments on its own performance. Official documents show that in knowledge and reasoning tasks, V4 - Pro has surpassed mainstream open - source models and is close to closed - source systems such as Gemini, but there is still a gap of about 3 to 6 months compared to the most advanced cutting - edge models. In Agent and code tasks, its performance is close to or even partially exceeds that of Claude Sonnet.

In terms of internal usage data, DeepSeek said that V4 has become the main model for Agentic Coding (agent - based programming) among its internal employees. Evaluation feedback shows that its usage experience is better than Claude Sonnet 4.5, and the delivery quality is close to that of Opus 4.6 in non - thinking mode, but there is still a certain gap compared to Opus 4.6 in thinking mode.

In mathematical, STEM, and competition - level code evaluations, V4 - Pro has surpassed all publicly evaluated open - source models, including Kimi K2.6 Thinking of the dark side and Zhipu GLM - 5.1 Thinking, and achieved results comparable to top - level closed - source models.

Blogger Simon Willison pointed out in his evaluation article that V4 - Pro (16 trillion parameters) is currently the largest known open - source weighted model, exceeding Kimi K2.6 (11 trillion), GLM - 5.1 (754 billion), and DeepSeek V3.2 (685 billion), providing new options for enterprise users interested in local deployment.

He also showed the pelican illustrations made by different models:

This is the pelican from DeepSeek - V4 - Flash:

As for DeepSeek - V4 - Pro:

Pricing system: As low as 1% of competitors, further price reduction possible in the second half of the year

DeepSeek's pricing strategy is the most market - concerned part of this release. The input/output prices of V4 - Flash are $0.14/$0.28 per million tokens respectively, lower than OpenAI GPT - 5.4 Nano ($0.20/$1.25) and Gemini 3.1 Flash - Lite ($0.25/$1.50), making it the lowest - priced option among current small - scale models.

The input/output prices of V4 - Pro are $1.74/$3.48, also lower than Gemini 3.1 Pro ($2/$12), GPT - 5.4 ($2.50/$15), Claude Sonnet 4.6 ($3/$15), and Claude Opus 4.7 ($5/$25).

The price comparison data summarized by blogger Simon Willison shows that V4 - Pro is currently the lowest - cost option among large - scale cutting - edge models, and V4 - Flash is the lowest - cost option among small - scale models, even lower than OpenAI's GPT - 5.4 Nano.

DeepSeek attributes its low - price ability to the extreme efficiency optimization of the model in super - long context scenarios. Official data shows that in a 1 - million - token scenario, the single - token inference computing power of V4 - Pro is only 27% of that of V3.2, and the KV cache is only 10%. For V4 - Flash, they are as low as 10% and 7% respectively.

It is worth noting that DeepSeek noted in its price description that "limited by high - end computing power, the service throughput of Pro is currently very limited. It is expected that after the mass - market launch of Ascend 950 super - nodes in the second half of the year, the price of Pro will be significantly reduced", suggesting that there is still room for further price reduction.

Technical architecture: Hybrid attention mechanism breaks through the long - context bottleneck, compatible with domestic computing power

The core technological innovation of DeepSeek - V4 lies in the first - created "CSA (Compressed Sparse Attention) + HCA (Heavily Compressed Attention)" hybrid attention architecture, aiming to solve the industry pain point that the traditional attention mechanism has a quadratic complexity increase in super - long context scenarios, making it difficult to implement in engineering in terms of video memory and computing power.

CSA compresses every 4 tokens into an information block and obtains the most relevant content through sparse retrieval, significantly reducing the amount of computation while retaining the middle - section details. HCA condenses massive information into framework - level information blocks and focuses on global logic processing.

In addition, V4 also introduces mHC manifold - constrained hyper - connection (upgrading the traditional residual connection to constrain signal propagation on a stable manifold) and the Muon optimizer (replacing the traditional AdamW, compatible with MoE large - scale models and low - precision training). Official data shows that the full - link engineering optimization can achieve a maximum inference acceleration of nearly 2 times.

In terms of compatibility with domestic computing power, DeepSeek - V4 has completed a comprehensive verification of the fine - grained expert parallel optimization scheme on the Huawei Ascend NPU platform, achieving an acceleration ratio of 1.50 to 1.73 times in general inference load scenarios. DeepSeek official said that V4 is the world's first trillion - parameter - level model to complete training

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Is it a huge success or a big flop? The first wave of reviews for Deepseek V4 is here (with a ranking list)

Third - party evaluations: Leading in code capabilities, closely chasing the top in overall ranking

Official self - evaluation: Cautious wording, the smallest gap in code and Agent fields

Pricing system: As low as 1% of competitors, further price reduction possible in the second half of the year

Technical architecture: Hybrid attention mechanism breaks through the long - context bottleneck, compatible with domestic computing power