Today, the research findings of DeepSeek's large language model, DeepSeek - R1, have been published as a cover article in the world - leading scientific journal Nature.
Compared with those models of OpenAI that cost tens of millions of dollars, this domestic AI model, which was trained with only $300,000, not only once caused fluctuations in the US stock market but also now graces the latest cover of Nature.
Comments on the Nature cover
The article on the Nature cover is the technical paper of R1, "DeepSeek - R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning", which DeepSeek published on arXiv at the beginning of the year.
List of paper authors. Wenfeng Liang is the corresponding author
Although it is generally similar to the paper published at the beginning of the year, it has added quite a lot of details.
The main text is only 11 double - column pages, but the supplementary materials have reached 83 pages. The peer review, which is the record of the discussion between the reviewers and the DeepSeek team on certain issues of the paper (usually called rebuttal), also has as many as 64 pages.
These newly released materials allow us to see the detailed training process of DeepSeek R1. Moreover, the team has disclosed for the first time that the key cost of training R1's reasoning ability is only $294,000.
In the peer - review documents, DeepSeek has also answered questions such as whether R1's success previously relied on "distillation" or, in other words, "plagiarized" the outputs of stronger models like those of OpenAI.
We did not deliberately include content generated by OpenAI. All training data was collected through web scraping.
Why did DeepSeek make it onto the Nature cover?
You may also wonder why DeepSeek R1, which is not the most powerful large language model globally, managed to get on the Nature cover.
Nature is the most influential journal globally. The so - called CNS in science and engineering disciplines refers to Cell, Nature, and Science as shown in the above figure. And the cover of Nature represents the top of the top.
In the AI industry, different from the top - tier conference in computer vision and pattern recognition, CVPR (ranked second in the above figure), the Nature cover has a special symbolic meaning. It is not only an endorsement of scientific research but also like the highest recognition in the scientific community.
In the past few years, OpenAI, Anthropic, and Google have all released various technical reports, but none of them has submitted their large models for peer review. The reasons are simple:
On the one hand, peer review means disclosing more details, which may involve trade secrets.
On the other hand, many claims about large models are easily questioned, and peer review requires you to provide evidence and accept external inquiries.
This time, DeepSeek submitted the R1 model to the academic system, had it reviewed item by item by 8 independent experts, and made the review comments and the authors' responses public.
This not only recognized the scientific value of R1 but also set a new benchmark for the entire industry. Large models are not just black boxes of companies; they can also withstand professional scientific scrutiny.
This is a historic moment for AI to become more scientific, and it is also an important reason why DeepSeek can make it onto the Nature cover.
Lewis Tunstall, a machine - learning engineer at the open - source AI platform HuggingFace, said during the review:
This is a very welcome precedent. Without public sharing and the norms for most of this process, it would be difficult to assess whether these systems pose risks.
Nature has also published a special article calling on other companies to submit their large language models for peer review.
In this recommended article, the editors of Nature specifically mentioned the benefits of peer review.
Relying on peer review by independent researchers is a way to calm the hype in the artificial intelligence industry.
Different from the technical reports and blogs we often see (known as model cards/system cards in the industry), peer review does not passively accept information but ensures that authors prove their claims. Just like when we watch the press conferences of some large language models, they all claim that their models have ranked first in certain benchmark tests.
However, peer review can restrain AI developers from choosing the benchmark tests that best showcase their models' performance to "grade their own papers" because benchmark tests can be manipulated to overestimate the performance of models.
Here are some key Q&As from the peer - review documents that we have selected.
Q: The base model (DeepSeek - V3 - Base) may have been exposed to a large amount of reasoning data generated by other models (such as OpenAI's models) during the pre - training phase, leading to an overestimation of the effectiveness of RL.
A: We selected the Qwen2 - 7B model, which was released before any advanced reasoning models were publicly available, as the base model. The experimental results show that after being trained with our pure reinforcement learning method, the reasoning ability of Qwen2 - 7B - Zero far exceeds that of its original version and the contemporary GPT - 4o model.
This experiment strongly demonstrates that our RL framework can independently stimulate advanced reasoning abilities in a clean base model, rather than simply replicating the patterns in the pre - training data.
Q: Related to but different in nature from the assessment of contamination, we want to know if there are any examples where the data may have been generated by other companies' models, as the media has suggested.
Data obtained directly or indirectly from benchmark test data or the Internet, which may be used in the training or reinforcement - learning datasets, could contain content generated by OpenAI's models or those of other providers.
This would make DeepSeek's model a form of "distillation" of OpenAI's models.
A: We are aware that model distillation is a widely discussed topic in the development of DeepSeek models.
During the pre - training phase, we admit that the collected web data may contain content generated by advanced models (such as GPT - 4). However, given the widespread presence of synthetic content on the Internet, this is inevitable in current large - scale language model training.
However, the core contribution of this paper, R1 - Zero, does not involve any distillation from advanced models. The reinforcement - learning component is trained independently and does not rely on the outputs or guidance of models such as GPT - 4.
Link to the full peer - review document 🔗: https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-025-09422-z/MediaObjects/41586_2025_9422_MOESM2_ESM.pdf
It can withstand the review because the technology is strong enough
Besides being the first large language model to undergo independent peer review, DeepSeek R1 also has remarkable technological breakthroughs of its own.
The most core contribution of DeepSeek - R1 is proving that pure reinforcement learning (RL) can effectively stimulate the reasoning ability of LLMs, enabling them to learn reasoning without relying on human - annotated thought paths.
Reinforcement - learning framework
Traditionally, to improve the reasoning ability of large models, humans usually need to manually provide a large number of thought chains for the models to imitate. However, this approach has two problems: it requires manual annotation, which is costly and unsustainable; it is limited by human thinking, and the models can only learn human routines, making it difficult to explore new reasoning paths.
R1's method is completely different. It only gives the model a reward signal, "add points if the answer is correct, subtract points if it is wrong", and allows the model to explore on its own without specifying intermediate reasoning steps.
As a result, during the training process, R1 exhibited behaviors similar to "self - reflection, verification, and dynamic adjustment". For example, it would say "Wait, I need to recheck this step" during the answering process. This kind of reflective segment is the so - called emergent reasoning ability.
The benchmark performance of DeepSeek - R1 and DeepSeekR1 - Zero was compared with human scores on different datasets.
In the public test, R1 achieved an accuracy rate of 77.9% in the 2024 American Invitational Mathematics Examination (AIME), far higher than the average human level. It even outperformed GPT - 4 in some code and science - reasoning tasks.
In the more detailed supplementary materials, DeepSeek disclosed the training details of R1, the specific path of how R1 evolved from R1 - Zero, and a comprehensive evaluation test of R1, including multilingual ability, security and risk control, stability, etc.
Link to the supplementary materials 🔗 (the corresponding author is also Wenfeng Liang): https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-025-09422-z/MediaObjects/41586_2025_9422_MOESM1_ESM.pdf
Since R1 was developed in January this year, the content in the report may not represent the latest methods of DeepSeek or the industry.
However, from this detailed report, we can see how R1 was created and how it achieved the "Well, let me think first" reasoning that everyone likes.
R1 - Zero: An ultimate reasoning model
The predecessor of DeepSeek R1 is DeepSeek R1 - Zero, a model that pursues ultimate reasoning and was born through the "wild growth" of an AI model.
The starting point for training R1 - Zero is the DeepSeek - V3 Base model, a Mixture - of - Experts (MoE) architecture model with a total of 671 billion parameters (37 billion parameters activated each time), which has been pre - trained on a large amount of Chinese and English web pages and e - book data.
Traditional supervised fine - tuning requires manually providing specific reasoning trajectories. The figure shows an example of an SFT trajectory in code - related reasoning data.
Different from the first step of traditional large - model fine - tuning, supervised fine - tuning (SFT), DeepSeek skipped this step directly. They hypothesized that if the model is trained with standard problem - solving steps written by humans from the beginning, it will limit the model's exploration space, and the upper limit of the model's performance will be restricted by human cognition.
Pure Reinforcement Learning (Pure RL)
The research team designed an extremely simple reinforcement - learning framework for the model, only telling it the most crucial rules.
Task format: The model is required to output in a fixed format, that