DeepSeek V3.1 Base Suddenly Launched: Outperforms Claude 4 in Programming, Internet Awaits R2 and V4

The new version of DeepSeek V3.1 is officially launched. It has a context length of 128k, and its programming capabilities outperform Claude 4 Opus. The cost is as low as $1.

Just last night, DeepSeek officially and quietly launched a brand - new V3.1 version, extending the context length to 128k.

The newly open - sourced V3.1 model has 685B parameters and supports multiple precision formats, from BF16 to FP8.

Based on comprehensive public information and the actual tests by domestic expert karminski3, the highlights of this V3.1 update are as follows:

Programming ability: It performs outstandingly. According to the test data of Aider used in the community, V3.1 ranks first among open - source models.

Performance breakthrough: V3.1 achieved a high score of 71.6% in the Aider programming benchmark test, surpassing Claude Opus 4, and at the same time, its inference and response speeds are faster.

Native search: It newly supports the native "search token", which means better search support.

Architectural innovation: The "R1" label is removed from the online model. Analysts say that DeepSeek is expected to adopt a "hybrid architecture" in the future.

Cost advantage: Each complete programming task only costs $1.01, which is only one - sixtieth of the cost of proprietary systems.

It's worth mentioning that the official group emphasized that the extension to 128K context was already supported in the previous V3 version.

People are extremely enthusiastic about this wave of updates.

Even before the model card was released, DeepSeek V3.1 has already ranked fourth on the Hugging Face trending list.

The number of DeepSeek fans has exceeded 80,000.

Seeing this, netizens are even more looking forward to the release of R2!

Hybrid inference, outperforming Claude 4 in programming

The most obvious change this time is that DeepSeek removed the "R1" from "Deep thinking (R1)" on the official APP and web version.

Meanwhile, compared with V3 - base, DeepSeek V3.1 newly adds four special Tokens:

<｜search▁begin｜> (id: 128796)

<｜search▁end｜> (id: 128797)

<think> (id: 128798)

</think> (id: 128799)

In this regard, there is speculation that this may imply the integration of inference models and non - inference models.

In terms of programming, according to the results exposed by netizens, DeepSeek V3.1 scored 71.6% in the Aider Polyglot multi - language programming test, defeating Claude 4 Opus and DeepSeek R1 at one stroke.

Moreover, its cost is only $1, making it the SOTA among non - inference models.

The most striking contrast is that V3.1's programming performance is 1% higher than that of Claude 4, and the cost is 68 times lower.

On the SVGBench benchmark, V3.1 is only second to GPT - 4.1 - mini in strength, far exceeding that of DeepSeek R1.

In terms of MMLU multi - task language understanding, DeepSeek V3.1 is not inferior to GPT - 5. However, there is still a certain gap between V3.1 and GPT - 5 in programming, graduate - level benchmark Q&A, and software engineering.

A netizen's actual test shows that in the physical test of simulating the free fall of a small ball in a hexagon, DeepSeek V3.1's understanding ability has been significantly improved.

First - hand actual test

We conducted an actual test on V3.1 immediately. First, let's focus on the key point of this model update: the context length.

Assuming that for Chinese, 1 token ≈ 1–1.3 Chinese characters, then 128K tokens ≈ 100,000–160,000 Chinese characters.

This is equivalent to 1/6–1/8 of the entire text of "A Dream of Red Mansions" (about 800,000–1,000,000 characters), or a super - long doctoral thesis/voluminous academic monograph.

The actual test is also quite accurate. DeepSeek told us that it can only read about 9%, that is, approximately one - tenth.

Since the summary is too long, we intercepted the first three chapters. What do you think of this summary?

In the 128K context test, the output speed of DeepSeek - V3.1 has been greatly improved compared with the past, and some optimizations have been made in engineering.

In this update, DeepSeek emphasized the support for context.

Let's put some pressure on DeepSeek - V3.1 and let it output as much content as possible based on the character "dream" to try to reach the context limit.

However, in the end, the model stopped outputting after outputting about 3000 characters.

Now let's look at the inference ability.

For the classic problem of comparing the sizes of 9.11 and 9.9, it can answer correctly in both ways of asking.

One of the most obvious feelings about this update is that the speed has become much faster.

Finally, let's take a look at the programming ability.

DeepSeek's previous model was R1 - 0528, which focused on programming ability.

Let's see if V3.1 has made greater improvements this time.

We can only give it a score of 80. It meets

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

DeepSeek V3.1 Base was suddenly launched. It outperformed Claude 4 in programming, and the whole internet is waiting for R2 and V4.

Hybrid inference, outperforming Claude 4 in programming

First - hand actual test