Just now, Liang Wenfeng published a paper in Nature.
Last night, DeepSeek made history once again!
According to a report by Zhidx on September 18th, on September 17th, a research paper on the DeepSeek-R1 inference model, jointly completed by the DeepSeek team with Liang Wenfeng as the corresponding author, made it onto the cover of the international authoritative journal Nature.
The DeepSeek-R1 paper publicly revealed for the first time the important research result that the inference ability of large models can be stimulated solely through reinforcement learning, inspiring AI researchers worldwide. This model has also become the most popular open-source inference model globally, with over 10.9 million downloads on Hugging Face. It is truly worthy of the recognition from Nature.
Meanwhile, DeepSeek-R1 is also the world's first mainstream large language model to undergo peer review. Nature highly commented in its editorial that almost all mainstream large models have not undergone independent peer review, and "this gap has finally been filled by DeepSeek."
Nature believes that in the AI industry, unsubstantiated claims and hype have become "commonplace," and what DeepSeek has done is "a welcome step towards transparency and reproducibility."
Title on the cover of Nature magazine: Self-help - Reinforcement learning teaches large models to self-improve
The new version of the DeepSeek-R1 paper published in Nature differs significantly from the initial, non-peer-reviewed version released in January this year. It discloses more details about model training and directly addresses the distillation doubts that arose when the model was first released.
The DeepSeek-R1 paper published in Nature magazine
In the 64-page peer review document, DeepSeek stated that all the data used in DeepSeek-V3 Base (the base model of DeepSeek-R1) comes from the Internet. Although it may contain results generated by GPT-4, this is not intentional, and there is no dedicated distillation process.
DeepSeek also provided a detailed process for reducing data contamination during training in the supplementary materials to prove that the model did not intentionally include benchmark tests in the training data to improve model performance.
In addition, DeepSeek conducted a comprehensive evaluation of the security of DeepSeek-R1 and proved that its security is ahead of cutting-edge models released during the same period.
Nature magazine believes that as AI technology becomes increasingly popular, unverifiable claims from large model manufacturers may pose real risks to society. Peer review by independent researchers is an effective way to curb excessive hype in the AI industry.
Paper link:
https://www.nature.com/articles/s41586-025-09422-z#code-availability
Peer review report:
https://www.nature.com/articles/s41586-025-09422-z#MOESM2
Supplementary materials:
https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-025-09422-z/MediaObjects/41586_2025_9422_MOESM1_ESM.pdf
01.
The new version of the paper reveals multiple important pieces of information
Comprehensive evaluation of R1's security
Before understanding the changes in the new version of the paper, it is necessary to review the core content of the DeepSeek-R1 paper.
The research starting point of DeepSeek-R1 was a major problem that plagued the AI industry at that time. As is well known, inference can enhance the capabilities of large language models. However, enabling models to learn thinking chain trajectories through data during the post-training phase heavily relies on manual annotation, which limits scalability.
DeepSeek attempted to enable the model to develop inference ability through self-evolution using reinforcement learning. Based on DeepSeek-V3 Base, DeepSeek used GRPO as the reinforcement learning framework. It only used the correctness of the final prediction result and the real answer as the reward signal without imposing restrictions on the inference process. Finally, DeepSeek-R1-Zero was constructed.
DeepSeek-R1-Zero successfully mastered improved inference strategies through reinforcement learning and tended to generate longer answers, with each answer containing verification, reflection, and exploration of alternative solutions.
The accuracy rate of DeepSeek-R1-Zero in answering questions increases with the length of inference, and the overall answer length also increases during model training
Based on DeepSeek-R1-Zero, DeepSeek developed DeepSeek-R1 by combining multi-stage training with RL, rejection sampling, and supervised fine-tuning, enabling the model to have strong inference ability and better fit human preferences. In addition, the team distilled a small model and publicly released it, providing available resources for the research community and promoting the development and application of thinking chain inference models.
In addition to the above major scientific research results, in the latest version of the paper and other materials, DeepSeek added a lot of supplementary information, allowing the outside world to have a deeper understanding of the details of model training and operation.
Benchmark test data contamination is an extremely sensitive issue. If manufacturers intentionally or unintentionally include benchmark tests and related answers during training, it is very likely to lead to abnormally high scores of the model in relevant tests, affecting the fairness of benchmark test scores.
DeepSeek revealed that to prevent benchmark test data contamination, it has implemented comprehensive decontamination measures on both the pre-training and post-training data of DeepSeek-R1. Taking the field of mathematics as an example, in the pre-training data alone, DeepSeek's decontamination process identified and deleted approximately six million potential texts.
During the post-training phase, all mathematics-related data came from competitions before 2023, and the same filtering strategy as in pre-training was adopted to ensure that the training data did not overlap with the evaluation data at all. These measures ensure that the model evaluation results can truly reflect its ability to solve problems rather than its memory of test data.
However, DeepSeek also admitted that this decontamination method cannot completely prevent the rewriting of the test set, so there may still be contamination problems in some benchmark tests released before 2024.
DeepSeek also added a comprehensive security report for DeepSeek-R1. The report mentioned that DeepSeek-R1 introduced an external risk control system in service deployment. It can not only identify unsafe conversations based on keyword matching but also use DeepSeek-V3 directly for risk review to determine whether to reject the response. DeepSeek suggests that developers deploy a similar risk control system when using DeepSeek-R1.
In public security benchmark tests and internal security research, DeepSeek-R1 outperformed cutting-edge models such as Claude-3.7-Sonnet and GPT-4o on most benchmarks. Although the security of the open-source deployment version is not as good as that of the version with an external risk control system, it still has a medium level of security guarantee.
When DeepSeek-R1 was first released, there were rumors that the model used OpenAI's model for distillation, which also appeared in the reviewers' questions.
In response, DeepSeek gave a direct answer, saying that the pre-training data of DeepSeek-V3-Base all comes from the network, reflecting the natural data distribution. "It may contain content generated by advanced models (such as GPT-4)", but DeepSeek-V3-Base does not have a 'cooling' stage of large-scale supervised distillation on synthetic datasets.
The data cut-off time for DeepSeek-V3-Base was July 2024, when no public advanced inference models had been released, which further reduced the possibility of unintentional distillation from existing inference models.
More importantly, the core contribution of the DeepSeek-R1 paper, namely R1-Zero, does not involve distillation from advanced models. Its reinforcement learning (RL) component is independently trained and does not rely on the output or guidance of GPT-4 or other models with similar capabilities.
02.
The R1 paper creates a new paradigm for large model scientific research
Nature highly praises it for filling the gap
In its editorial, Nature analyzed in detail the value of DeepSeek-R1 going through the complete peer review process and making it onto the journal.
Large models are rapidly changing the way humans acquire knowledge. However, currently, the most mainstream large models have not undergone independent peer review in research journals, which is a serious gap.
Peer-reviewed publications help clarify how large models work and also help the industry evaluate whether the performance of large models is consistent with what manufacturers claim.
DeepSeek changed this situation. DeepSeek submitted the DeepSeek-R1 paper to Nature on February 14th this year, and it was not accepted until July 17th and officially published on September 17th.
During this process, eight external experts participated in the peer review, evaluating the originality, methods, and robustness of this work. In the final published version, both the review report and the authors' responses were disclosed.
Zhidx also conducted an in-depth study of the review comments and the authors' responses to the DeepSeek-R1 paper. This document is 64 pages long, nearly three times the length of the paper itself.
Cover of DeepSeek's peer review materials
The eight reviewers put forward more than a hundred specific comments, including modifications to details such as the singular and plural forms of words, warnings about "anthropomorphizing" AI in the paper, and concerns about data contamination and model security issues.
For example, in the following modification comments, the reviewer keenly noticed the ambiguity of the expression "open-sourcing DeepSeek-R1-Zero" and reminded DeepSeek that the definition of the concept of "open source" is still controversial, and extra attention needs to be paid when using relevant expressions.
This reviewer also required DeepSeek to attach links to the SFT and RL data in the paper instead of just providing data samples.
Some modification comments from a reviewer
DeepSeek seriously responded to every question raised by the reviewers. The multiple sections and supplementary information mentioned above were added based on the reviewers' suggestions.
Although DeepSeek also released a technical report on DeepSeek-R1 in January this year, Nature believes that there may be a large gap between such technical documents and the actual situation.
In contrast, in peer review, external experts do not passively receive information. Instead, under the hosting and management of an independent third party (the editor), they can collaboratively raise questions and require the paper authors to supplement information.
Peer review can improve the clarity of the paper and ensure that the authors make reasonable arguments for their claims. This process does not necessarily bring major modifications to the content of the article, but it can enhance the credibility of the research. For AI developers, this means that their work will be more solid and persuasive.
03.
Conclusion: DeepSeek's open-source model
May become an industry model
As a representative of domestic open-source AI models going global, DeepSeek-R1 has an excellent reputation in the global open-source community. After making it onto the cover of Nature magazine this time, DeepSeek added more information about this model, providing scientific research references, ideas for model reproduction, and application support for the open-source community.
Nature magazine called on more AI companies to submit their models for peer review to ensure that their claims are verified and clarified. Against this background, DeepSeek's open-source model not only demonstrates the technical strength of domestic AI but also is expected to become a reference model for the global AI industry in terms of scientific research transparency.
This article is from the WeChat official account "Zhidx" (ID: zhidxcom), written by Chen Junda and edited by Li Shuiqing. It is published by 36Kr with authorization.