DeepSeek on Nature's Cover: Liang Wenfeng Leads Team to Address Doubts, R1 Training Costs $294,000

DeepSeek's appearance on the cover of Nature is truly well-deserved! In January this year, LIANG Wenfeng led the R1 project and pioneered a new paradigm for AI reasoning - pure RL can unlock the infinite reasoning capabilities of LLMs. Nature also published a special commentary article, highly commending it.

Just now, DeepSeek-R1 made it onto the cover of Nature!

In January this year, the paper "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" was published, and now it has successfully made it onto the cover of the world's top journal.

Led by corresponding author Liang Wenfeng, the team used RL to open up a brand - new path for the reasoning ability of large models.

Paper link: https://www.nature.com/articles/s41586-025-09422-z

In the cover recommendation, Nature spared no praise for the achievements of DeepSeek-R1.

After being open - sourced, R1 became the most popular model on Hugging Face, with over 10.9 million downloads. The key is that it is the world's first mainstream large model to undergo peer review.

It's worth noting that the supplementary materials publicly revealed the training cost of R1 for the first time - $294,000, an astonishingly low figure.

Even when adding the approximately $6 million cost of the base model, it is still far lower than the cost of training AI by OpenAI and Google.

From an arXiv paper to the cover of Nature, the DeepSeek team has once again paved the way for the future of AI reasoning with their strength.

R1 is considered the first mainstream LLM to go through the peer - review process.

Reviewer Lewis Tunstall said:

This is a very welcome precedent. Without publicly sharing most of the R & D process, it will be difficult for us to assess whether these systems pose risks.

In response to peer - review comments, DeepSeek reduced anthropomorphic descriptions and added more technical details, including the types of model training data and security performance.

Reviewer Huan Sun said:

Going through a rigorous peer - review process helps verify the effectiveness and practicality of the model, and other companies should follow suit.

The Birth of DeepSeek-R1-Zero

The research team's starting point was bold and pure: to completely break away from the dependence on human reasoning trajectories.

The reasoning patterns defined by humans may actually be a constraint.

They chose a powerful base model, DeepSeek - V3 Base, and skipped the traditional SFT stage.

Instead, they used an extremely simple reinforcement - learning framework, telling the model only two things:

1. Task format: The answer must consist of two parts. One is the "thinking process" wrapped in <think> tags, and the other is the "final answer" wrapped in <answer> tags.

2. Reward signal: Rewards are given based on whether the final answer is correct, regardless of the thinking method used.

Without the judgment of the correctness of problem - solving steps and without guidance on thinking methods, DeepSeek - R1 - Zero began its "wild growth".

During the entire training process, the reasoning ability of R1 - Zero underwent a qualitative leap.

Taking AIME 2024 as an example, its average problem - solving accuracy (pass@1) soared from the initial 15.6% to 77.9%.

If combined with the "self - consistent decoding" technology, the accuracy rate is as high as 86.7% - this result far exceeds the average level of all human contestants in the AIME competition.

The "Aha Moment" of AI

What's even more fascinating is its self - evolution behavior during the process of ability improvement.

Autonomous increase in "thinking time"

As the training progressed, the length of the text generated by the model within the <think> tags increased steadily.

It spontaneously learned to use longer "chains of thought" to explore and optimize problem - solving strategies. Sometimes, it would generate hundreds or thousands of tokens to deliberate on a single problem.

The emergence of advanced reasoning strategies

The model no longer solves problems linearly step by step. Instead, it began to exhibit advanced strategies such as "self - reflection" and "systematic exploration of alternative solutions".

It would verify its intermediate steps and even actively explore "what if I use another method?"

An interesting "Aha Moment"

At a certain stage of training, the researchers observed a clear "Aha Moment".

That is, during the reflection process, the frequency of the model using the word "wait" suddenly increased sharply.

This moment marked a significant shift in the reasoning mode of DeepSeek - R1 - Zero, clearly revealing its self - evolution process.

This evolution perfectly interprets the charm of reinforcement learning:

There's no need to teach it how to solve problems. Just provide the right incentives, and it will autonomously develop strategies more advanced than those taught by humans.

The Road of DeepSeek-R1

Although DeepSeek - R1 - Zero demonstrated god - level reasoning ability, since its training was completely oriented towards reasoning, it had problems such as poor readability, occasional chaotic switching between Chinese and English, and mediocre performance in general abilities such as writing and open - domain question - answering.

To solve the problems of R1 - Zero and make its powerful reasoning ability more widely applicable, the research team designed a sophisticated multi - stage training process and launched the second - stage "refinement" plan:

1. Cold Start: First, the model was initially fine - tuned with thousands of high - quality data that conformed to human conversation habits to teach it to "speak properly".

2. First - round Reinforcement Learning (RL): Reinforcement learning was applied again, but this time the goal was not only to improve reasoning but also to maintain language consistency and conversation fluency.

3. Large - scale Supervised Fine - Tuning (SFT): The team mixed reasoning data with a large amount of non - reasoning data (such as writing, general question - answering, and code engineering) for large - scale supervised fine - tuning. This greatly expanded the model's knowledge and general abilities.

4. Second - round Reinforcement Learning (RL): Finally, a comprehensive round of reinforcement learning was carried out using a more complex reward model to further enhance the model's usefulness, harmlessness, and align its behavior with human preferences.

After multiple rounds of training, DeepSeek - R1 not only improved its performance by 17% - 25% on benchmarks such as AlpacaEval 2.0 and Arena - Hard, which measure general instruction - following and user preferences, but also maintained a top - level performance in high - difficulty reasoning tasks such as mathematics and programming.

Unveiling the "Alchemy Furnace" of DeepSeek-R1

Next, let's delve into the interior of this "alchemy furnace" to find out.

The GRPO Algorithm

In the field of AI training, the reinforcement - learning algorithm PPO (Proximal Policy Optimization) has long been the "standard vehicle" for training large language models. Although powerful, it is also known for its high resource consumption and complex implementation.

The DeepSeek team chose a smarter approach and used the GRPO (Group Relative Policy Optimization) algorithm as the core driving engine.

PPO is like an extremely cautious coach. In each training update, it strictly limits the degree of deviation of the new policy from the old one to prevent the model from "going astray" and causing the training to collapse.

This caution comes at a cost. It requires a large amount of computation to maintain stability.

On the other hand, GRPO is like a more efficient coach who believes in the "wisdom of the crowd". Its core idea is:

In each training session, the model is made to generate a group (e.g., 16) of different answers for the same question.

Then, instead of simply rewarding the best one, it optimizes the model as a whole based on the "relative quality" of this group of answers.

Specifically, it calculates the "advantage" of each answer relative to the average level of this group of answers. Answers with a greater advantage (i.e., better performance) will receive a greater incentive weight, while those with poor performance will be suppressed.

This "intra - group competition and learning from the best" mechanism simplifies the complex constraint process of PPO, significantly reducing resource consumption and proving to be equally stable and efficient in practice.

Reward Design

The essence of reinforcement learning is to shape the model's behavior through rewards. It determines the direction in which the model will evolve.

For this reason, the DeepSeek team designed a dual - track reward system.

1. Rule - based reward

For reasoning tasks (mathematics, programming, logic), the team adopted an extremely strict rule - based reward system.

Accuracy reward: Is the final answer correct? For math problems, the answer must be exactly the same as the standard answer; for programming problems, the code must pass all preset test cases.

Format reward: Does the thinking process comply with the specifications? All thinking processes must be encapsulated within <think> and </think> tags.

Here, there was a crucial decision: No neural - network - based reward model was used for reasoning tasks.

Because the team found that during long - term and large - scale reinforcement learning, AI would find loopholes in the reward model itself and exploit them, which is called "reward hacking".

2. Model - based reward

However, the world is not black and white. For general tasks such as writing and dialogue, there are mostly just differences in quality.

So, the DeepSeek team introduced model - based rewards to make the model more in line with human preferences.

Usefulness reward model: It is specifically responsible for judging whether the model's answer is useful and relevant to the user. It learns human preferences by comparing a large number of pairs of "good answers" and "bad answers" (generated and screened by DeepSeek - V3). Interestingly, it only evaluates the final summary part and does not interfere with the underlying reasoning process, giving the model full freedom in thinking.

Security reward model: It is responsible for checking the entire output of the model