HomeArticle

Without any warning, DeepSeek R1 suddenly updated an 86-page paper. This is the real Open.

新智元2026-01-09 11:11
The paper of DeepSeek R1 has been expanded to 86 pages. Reinforcement learning enhances inference capabilities, and the open-source model rivals closed-source models.

The paper of R1 has soared to 86 pages! DeepSeek proves to the world that open - source can not only catch up with closed - source but also teach closed - source how to do things!

The whole network is shocked!

Two days ago, DeepSeek quietly updated the R1 paper, "expanding" it from the original 22 pages to 86 pages.

The brand - new paper proves that simply using reinforcement learning can enhance the reasoning ability of AI!

DeepSeek seems to be preparing a big move. Some netizens even speculate that a pure reinforcement learning method might appear in R2.

This time's update directly upgrades the original paper to a technical report that the open - source community can fully reproduce.

Paper address: https://arxiv.org/abs/2501.12948

In the paper, the new content of DeepSeek - R1 is packed with valuable information, with an explosion of information content —

  • Precise data formula: Clearly state the data scale (26,000 math problems, 17,000 pieces of code) and the specific creation process
  • Infrastructure description: Schematic diagram of vLLM/DualPipe settings
  • Breakdown of training costs: Approximately $294,000 in total (R1 - Zero used H800 GPUs for 198 hours)
  • Replay of "failed attempts": In - depth explanation of why PRM was not successful
  • Model comparison: Systematic comparison with DS - V3, Claude, and GPT - 4o (previously only included o1)
  • 10 - page security report: Detailed description of security assessment and risk analysis

The results show that DeepSeek R1's multiple capabilities are comparable to OpenAI o1 and even surpass o1 - mini, GPT - 4o, and Claude 3.5.

Moreover, the list of core contributors at the end of this paper lists their specific contributions.

Some netizens said that this update can be regarded as a textbook! In particular, the details of DeepSeek - R1 - Zero's self - evolution are the real highlights.

It's worth mentioning that the DeepSeek application also added a new feature a few days ago — voice input support. Some netizens speculate that they might be focusing on multimodality.

Next, let's break down the core highlights of the latest paper content.

DeepSeek R1 has a major update, matching the strength of o1

First, let's look at the specific evaluation results of DeepSeek - R1.

The latest evaluation still covers a comprehensive comparison of tasks such as mathematical reasoning, coding, general knowledge & understanding, and factual & instruction following.

On educational knowledge benchmarks, including MMLU, MMLU - Pro, and GPQA Diamond, DeepSeek - R1 overall outperforms DS - V3.

In particular, the accuracy on STEM - related questions has significantly improved — The greatest credit for this goes to RL.

In addition, in the long - context Q&A task (FRAMES), DeepSeek - R1 performs outstandingly, with excellent document understanding and analysis capabilities.

In math and coding tasks, DeepSeek - R1 is basically on par with OpenAI - o1 - 1217 and significantly ahead of other models.

In more practical programming tasks, OpenAI - o1 - 1217 performs better than DeepSeek - R1 on Aider, but the two are at a comparable level on SWE Verified.

In DeepSeek's view, mainly because there is not enough engineering - class RL training data, the capabilities of DeepSeek - R1 in this area have not been fully exerted.

In the next version, we may see a significant improvement in this area.

The following figure shows the performance comparison between DeepSeek - R1 and DeepSeek - R1 - Zero and human experts in multiple benchmark competitions.

  • AIME math competition: DeepSeek - R1's score has exceeded the average level of humans.
  • Codeforces programming competition: DeepSeek - R1's performance surpasses 93.6% of the participants, with super - strong problem - solving ability.
  • GPQA scientific Q&A: Humans have stronger overall strength and perform better than DeepSeek - R1.

DeepSeek believes that if R1 could also access the Internet, it might catch up with and even surpass the current human level.

In the manual evaluation stage, the ChatbotArena ring was used, and the ELO score was used to reflect DeepSeek - R1's performance in human preferences.

Obviously, R1 has achieved remarkable results. In particular, in "style control", it tied with OpenAI - o1 and Gemini - Exp - 1206 for the first place.

The design of "style control" directly addresses a key question: Can a model "please" human reviewers through longer, more elaborate, or better - looking answers, even if the content itself is not necessarily stronger?

DeepSeek emphasizes that an open - source model based on the MIT license, with an overall performance comparable to multiple closed - source AIs, is undoubtedly an important milestone.

Especially when DeepSeek - R1 has a lower usage cost.

The following Figure 12 further shows the ranking results under different evaluation dimensions, presenting R1's strong strength in multiple fields such as math and programming.

This indicates that R1 not only has strong reasoning ability but also performs quite well in various practical application scenarios.

In terms of data, DeepSeek has released the specific scale of RL data and fine - tuning data.

In the reinforcement learning stage, the data ratio is allocated as follows: math (26k), code (17k), STEM (22k), logic (15k), general (66k).

In the fine - tuning stage, the data scale is about 800k, covering reasoning, general instruction tasks, and format/language consistency samples.

Distillation enables one - click migration of reasoning ability

In the distillation part, DeepSeek answers this question —

Can the "reasoning ability" learned by DeepSeek - R1 be effectively and stably migrated to smaller models?

Here, DeepSeek, as the "teacher" model, generates high - quality, explicit reasoning trajectory data and "distills" the reasoning ability to smaller "student" models through SFT, rather than letting the small models run RL again.

Through distillation, the small models directly learn the reasoning patterns that have been verified to be effective by R1 without having to re - explore the reward space.

In the paper, DeepSeek experimentally distilled models of multiple scales, including 1.5B, 7B, 8B, 14B, 32B, and 70B, systematically verifying the "cross - scale effectiveness".

Compared with models of the same size, the performance after distillation has been comprehensively improved.

An important phenomenon can be observed: the reasoning ability is not "locked" in large models but can be migrated to small models through data.

In terms of training costs, DeepSeek - R1 - Zero used 64×8 H800 GPUs, and the overall training time was about 198 hours.

In the DeepSeek - R1 training stage, the same GPU configuration was used, and the training was completed in about 4 days, approximately 80 hours.

In addition, in the process of building the supervised fine - tuning (SFT) dataset, about 5000 GPU hours were consumed in total.

A total of $294,000 was spent. For details, please refer to Table 7.

Some netizens said that it's time for Alex Wang to apologize. All the evidence is here.