The paper of DeepSeek-R1 has made it onto the cover of Nature. The corresponding author is LIANG Wenfeng.
What a surprise!
But it's well-deserved!
The cover of the latest issue of Nature features the research on DeepSeek-R1.
It refers to the paper "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" published by DeepSeek on arXiv in January this year. The corresponding author of this Nature paper is Liang Wenfeng.
Paper link: https://www.nature.com/articles/s41586-025-09422-z
In the recommended introduction on the cover, Nature writes:
If large models can be trained to plan the steps required to solve problems, they often can solve problems better. This kind of "reasoning" is similar to the way humans handle more complex problems. However, it poses a great challenge to artificial intelligence and requires manual intervention to add labels and annotations. In this week's issue, researchers at DeepSeek reveal how they can train a model to reason with minimal human input.
The DeepSeek-R1 model is trained using reinforcement learning. In this type of learning, the model is rewarded with high scores when it correctly answers mathematical problems and punished when it answers incorrectly. As a result, it learns to reason - solve problems step by step and reveal these steps - making it more likely to arrive at the correct answer. This enables DeepSeek-R1 to self-verify and self-reflect, checking its performance before giving answers to new problems, thereby improving its performance in programming and postgraduate-level scientific problems.
In addition, in this issue, Nature also highly praises the open model of DeepSeek-R1.
It is worth noting that R1 is considered the first large language model to pass peer review in an authoritative academic journal.
Lewis Tunstall, a machine learning engineer at Hugging Face and one of the reviewers of this paper, said: "This is a much-welcomed precedent. Without industry norms for publicly sharing most of the R & D process, it will be difficult for us to assess the potential risks of these systems."
In response to the review comments, the DeepSeek team not only avoided anthropomorphic descriptions of the model in the paper but also added technical details about the type of training data and security. Huan Sun, an AI researcher at Ohio State University, commented: "Going through strict peer review can undoubtedly effectively verify the reliability and practical value of the model. Other companies should also follow this example."
Obviously, the current AI industry is filled with amazing demonstrations at press conferences and constantly refreshed leaderboard scores.
But as pointed out in the article, benchmark tests can be "manipulated." Submitting the model's design, methodology, and limitations to independent external experts for review can effectively squeeze out the "water" in them.
Peer review acts as a fair "gatekeeper." It requires AI companies to shift from self-promotion like "a melon vendor praising their own wares" to supporting their claims with solid evidence and reproducible processes.
Therefore, although the DeepSeek-R1 paper itself has its scientific value, as the first LLM to undergo and pass peer review in a mainstream journal, its "procedural value" may be more profound.
It is foreseeable that incorporating LLMs into an independent peer review system is a key step from a "technology competition" to "scientific discipline," which is crucial for curbing industry chaos and building public trust.
Next, let's review this groundbreaking research. It is also recommended that you carefully read the paper published in Nature for more supplementary details:
The multi-stage pipeline of DeepSeek-R1
Previous research mainly relied on a large amount of supervised data to improve model performance. The development team at DeepSeek has opened up a new way of thinking: even without using supervised fine-tuning (SFT) as a cold start, large-scale reinforcement learning can significantly improve the model's reasoning ability. The effect will be even better if a small amount of cold start data is added.
To achieve this, they developed DeepSeek-R1-Zero. Specifically, DeepSeek-R1-Zero has the following three unique designs:
First, it uses Group Relative Policy Optimization (GRPO) to reduce training costs. GRPO does not require an evaluation model of the same size as the policy model but directly estimates the baseline from the group scores.
Second is reward design. How to design rewards determines the direction of RL optimization. DeepSeek's solution is to use two complementary reward mechanisms of accuracy and format.
The third point is the training template. Based on GRPO and reward design, the development team designed a simple template as shown in Table 1 to guide the base model. This template requires DeepSeek-R1-Zero to first provide the reasoning process and then the final answer. This design only standardizes the basic structure without imposing any restrictions or biases on the content, such as not forcing the use of reflective reasoning or specific problem-solving methods. This minimal-intervention design can clearly observe the model's progress in RL.
During the training process, DeepSeek-R1-Zero demonstrated significant self-evolution ability. It learned to generate hundreds to thousands of reasoning tokens, enabling it to explore and refine the thinking process more deeply.
As the training progressed, the model also developed some advanced behaviors, such as the ability to reflect and explore different problem-solving methods. These were not pre-set but emerged naturally in the reinforcement learning environment.
Notably, the development team observed an interesting "Aha Moment." In the middle stage of training, DeepSeek-R1-Zero learned to more reasonably allocate thinking time by re-evaluating the initial method. This may be the charm of reinforcement learning: as long as the correct reward mechanism is provided, the model can independently develop advanced problem-solving strategies.
However, DeepSeek-R1-Zero still has some limitations, such as poor readability of answers and mixed languages.
Reinforcement Learning with Cold Start
Different from DeepSeek-R1-Zero, to prevent the base model from experiencing an unstable cold start stage in the early stage of RL training, the development team constructed and collected a small amount of long Chain of Thought (CoT) data for R1 to fine-tune the model as the initial RL actor. To collect such data, the development team explored several methods: taking few-shot prompts of long CoT as an example, directly prompting the model to generate detailed answers through reflection and verification, collecting the outputs of DeepSeek-R1-Zero in a readable format, and refining the results through post-processing by human annotators.
DeepSeek collected thousands of cold start data to fine-tune DeepSeek-V3-Base as the starting point for RL. Compared with DeepSeek-R1-Zero, the advantages of cold start data include:
Readability: One of the main limitations of DeepSeek-R1-Zero is that its content is usually not suitable for reading. The responses may mix multiple languages or lack markdown format to highlight the answers for users. In contrast, when creating cold start data for R1, the development team designed a readable mode that includes a summary at the end of each response and filters out unfriendly responses.
Potential: Through carefully designing the cold start data mode with human prior knowledge, the development team observed better performance compared with DeepSeek-R1-Zero. The development team believes that iterative training is a better method for reasoning models.
Reasoning-Oriented Reinforcement Learning
After fine-tuning DeepSeek-V3-Base with cold start data, the development team adopted the same large-scale reinforcement learning training process as DeepSeek-R1-Zero. This stage focuses on enhancing the model's reasoning ability, especially in reasoning-intensive tasks such as coding, mathematics, science, and logical reasoning.
To alleviate the problem of mixed languages, the development team introduced a language consistency reward in RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that this alignment leads to a slight decline in model performance, this reward conforms to human preferences and is more readable.
Finally, the development team directly added the accuracy of reasoning tasks and the language consistency reward to form the final reward. Then, they conducted reinforcement learning (RL) training on the fine-tuned model until it converged on reasoning tasks.
Rejection Sampling and Supervised Fine-Tuning
When the reasoning-oriented reinforcement learning converges, the development team uses the generated checkpoints to collect SFT (Supervised Fine-Tuning) data for subsequent rounds. This stage combines data from other domains to enhance the model's ability in writing, role-playing, and other general tasks.
The development team sorted out reasoning prompts and generated reasoning trajectories by performing rejection sampling from the checkpoints of the above reinforcement learning training. In this stage, the dataset was expanded by merging other data. Some of the data used a generated reward model, and the ground truth and model predictions were input into DeepSeek-V3 for judgment.
In addition, the development team filtered out the chains of thought with mixed languages, long paragraphs, and code blocks. For each prompt, they sampled multiple answers and only retained the correct ones. Finally, the development team collected approximately 600,000 training samples related to reasoning.
Reinforcement Learning for All Scenarios
To further align the model with human preferences, a second stage of reinforcement learning needs to be implemented here, aiming to improve the model's usefulness and harmlessness while refining its reasoning ability.
Specifically, the researchers used a combination of reward signals and various prompt distributions to train the model. For reasoning data, they followed the method outlined in DeepSeek-R1-Zero, which uses rule-based rewards to guide the learning process in the fields of mathematics, code, and logical reasoning; for general data, they used a reward model to capture human preferences in complex and subtle scenarios.
Finally, the integration of reward signals and diverse data distributions enables us to train a model that performs excellently in reasoning while prioritizing usefulness and harmlessness.
Distillation: Enabling Small Models with Reasoning Ability
To enable more efficient small models to have the reasoning ability of DeepSeek-R1, the development team directly fine-tuned open-source models such as Qwen and Llama using 800,000 samples sorted out by DeepSeek-R1. The research results show that this simple distillation method significantly enhances the reasoning ability of small models.
Thanks to the innovation of the above multiple technologies, a large number of benchmark tests by the development team show that DeepSeek-R1 has achieved the hard power comparable to the industry's SOTA reasoning large models. You can refer to the following results specifically:
For more technical details, please refer to the original paper.
This article is from the WeChat official account "Machine Intelligence" (ID: almosthuman2014). The author is Machine Intelligence focusing on AI. It is published by 36Kr with authorization.