DeepSeek V3.1 Base ist plötzlich online gegangen und hat Claude 4 geschlagen, wobei seine Programmierfähigkeiten extrem beeindruckend sind. Das gesamte Internet wartet auf R2 und V4.
Just last night, DeepSeek officially and quietly launched the brand - new V3.1 version, expanding the context length to 128k.
The newly open - sourced V3.1 model has 685B parameters and supports multiple precision formats, from BF16 to FP8.
Based on comprehensive public information and the actual tests of domestic expert karminski3, the highlights of this V3.1 update are as follows:
Programming ability: It performs outstandingly. According to the test data of Aider used in the community, V3.1 ranks first among open - source models.
Performance breakthrough: V3.1 scored 71.6% in the Aider programming benchmark test, surpassing Claude Opus 4, and at the same time, its inference and response speeds are faster.
Native search: It has added support for the native "search token", which means better search support.
Architectural innovation: The "R1" label is removed from the online model. Analysts say that DeepSeek is expected to adopt a "hybrid architecture" in the future.
Cost advantage: Each complete programming task only costs $1.01, and the cost is only one - sixtieth of that of proprietary systems.
It is worth mentioning that the official group emphasized the expansion to 128K context, which was already supported in the previous V3 version.
The enthusiasm of everyone for this wave of updates is quite high.
Even before the model card was announced, DeepSeek V3.1 had already ranked fourth on the Hugging Face trend list.
The number of DeepSeek fans has exceeded 80,000.
Seeing this, netizens are even more looking forward to the release of R2!
Hybrid inference, outperforming Claude 4 in programming
The most obvious change this time is that DeepSeek removed the "R1" from "Deep thinking (R1)" on the official APP and web version.
Meanwhile, compared with V3 - base, DeepSeek V3.1 has added four special Tokens:
<|search▁begin|> (id: 128796)
<|search▁end|> (id: 128797)
<think> (id: 128798)
</think> (id: 128799)
In this regard, there is speculation that this may imply the integration of inference models and non - inference models.
In terms of programming, according to the results exposed by netizens, DeepSeek V3.1 scored 71.6% in the Aider Polyglot multi - language programming test, defeating Claude 4 Opus and DeepSeek R1 at one stroke.
Moreover, its cost is only $1, making it the SOTA among non - inference models.
The most obvious contrast is that V3.1's programming performance is 1% higher than that of Claude 4, and the cost is 68 times lower.
On the SVGBench benchmark, V3.1 is only second to GPT - 4.1 - mini in strength, far exceeding the strength of DeepSeek R1.
In terms of MMLU multi - task language understanding, DeepSeek V3.1 is not inferior to GPT - 5. However, there is still a certain gap between V3.1 and GPT - 5 in programming, graduate - level benchmark Q&A, and software engineering.
According to the actual test of a netizen, in the physical test of simulating the free fall of a small ball in a hexagon, the understanding ability of DeepSeek V3.1 has been significantly improved.
First - hand actual test
We conducted an actual test on V3.1 right away. First, let's focus on the key point of this model update: the context length.
Assuming that for Chinese, 1 token ≈ 1–1.3 Chinese characters, then 128K tokens ≈ 100,000–160,000 Chinese characters.
It is equivalent to 1/6–1/8 of the entire main text of "A Dream of Red Mansions" (about 800,000–1,000,000 characters), or a super - long doctoral dissertation/voluminous academic monograph.
The actual test is also quite accurate. DeepSeek told us that it can only read about 9%, that is, approximately one - tenth.
Since the summary is too long, we intercepted the first three chapters. What do you think of this summary?
In the 128K context test, the output speed of DeepSeek - V3.1 has been greatly improved compared with the past, and some optimizations have been made in engineering.
In this update, DeepSeek emphasized the support for context.
Let's put some pressure on DeepSeek - V3.1 and let it output as much content as possible based on the character "dream", trying to reach the context limit.
However, in the end, the model stopped outputting after outputting about 3000 characters.
Now, let's take a look at the inference ability.
In the classic problem of comparing the sizes of 9.11 and 9.9, it can answer correctly in both ways of asking.
One of the most obvious feelings about this update is that the speed has become much faster.
Finally, let's take a look at the programming ability.
DeepSeek's previous model was R1 - 0528, which mainly focused on programming ability.
Let's see if V3.1 has made greater improvements this time.
We can