DeepSeek - R1 on Nature's Cover: First Mainstream Large Model Passes Leading Academic Journal Peer Review with Positive Feedback, Training Cost $294,000

We look forward to more cutting-edge models following in the footsteps of DeepSeek and sharing the technical details of peer review for AI models.

The research findings related to DeepSeek-R1 are featured on the cover of Nature. As the first mainstream large model to pass the peer review of an authoritative journal, it has also supplementary disclosed the training costs beyond the technical aspects.

On September 17th, the research findings related to DeepSeek-R1 were featured on the cover of Nature, and this news quickly sparked heated discussions in the global academic community. In fact, the relevant research findings were published on arXiv as a preprint in January this year. However, the significance of this public release in Nature lies in its passing the peer review of this authoritative journal. In other words, external experts are not merely receiving one-way information but can, under the supervision and management of an independent third party (editors), ask questions and request more information from the author team through a collaborative process, which is the first time in the industry.

More importantly, different from the preprint paper published in January, which outlined the research methods and the performance of DeepSeek-R1 on a series of evaluation benchmarks, this officially published paper supplementary discloses the training costs of the model. According to a report from Nature News, the training cost of DeepSeek-R1 is only equivalent to $294,000. Although DeepSeek has invested approximately $6 million in the underlying LLM on which the R1 model is based, the total cost is still far lower than the tens of millions of dollars generally considered necessary for training leading models in the industry.

* Preprint paper link: https://hyper.ai/cn/papers/2504.07128

Training costs of DeepSeek-R1

DeepSeek stated that during the training of DeepSeek-R1-Zero, a total of 648 H800 GPUs were used, and the entire process took approximately 198 hours. Additionally, during the training of DeepSeek-R1, 648 H800 GPUs were also used, and the training took about 4 days, approximately 80 hours. To construct the SFT dataset, approximately 5,000 GPU hours were also consumed. The specific costs are shown in the figure above.

Large-scale Reinforcement Learning to Enhance Inference Ability

There is no need to elaborate on the significance of the inference ability of large models, which has become a key research direction in the industry. However, obtaining inference ability during the pre-training phase often requires a huge amount of computing resources. Some studies have shown that the ability of LLMs can be effectively enhanced through CoT (Chain-of-Thought) prompting, or the performance can be further improved by learning high-quality multi-step inference trajectories during the post-training phase. Although these methods are effective, they still have obvious limitations. For example, they rely on manually annotated inference processes, which reduces scalability and introduces cognitive biases. In addition, since the model is restricted to imitate human thinking patterns, its performance is essentially limited by the examples provided by humans and cannot explore better inference paths beyond human thinking patterns.

In response to this, DeepSeek, based on DeepSeek-V3 Base8, adopted Group Relative Policy Optimization (GRPO) as the RL framework and skipped the traditional supervised fine-tuning (SFT) phase before RL training. This design choice stems from the team's hypothesis: Human-defined inference patterns may limit the model's exploration, while unrestricted RL training can better promote the emergence of new inference abilities in LLMs.

Based on this, the team developed DeepSeek-R1-Zero, which demonstrates diverse and complex inference behaviors. To solve inference problems, the model tends to generate longer responses, incorporating verification, reflection, and exploration of different solutions in each response. Although the team did not explicitly teach the model how to reason, it still successfully learned better inference strategies through RL. The research team adopted Group Relative Policy Optimization (GRPO), an algorithm initially proposed to simplify the training process and reduce the resource consumption of Proximal Policy Optimization (PPO). It does not require an evaluation model of the same size as the policy model but directly estimates the baseline from group scores.

In addition, the team adopted a rule-based reward system to calculate accuracy and format rewards. Subsequently, based on GRPO and the reward design, the team designed a template that requires DeepSeek-R1-Zero to first generate an inference process and then generate the final answer. During the training process, specific inference questions are used to replace the prompt.

Learning to rethink using a personified tone

Specifically, after receiving a user's question, the model first outputs the inference process in the "Think" tag and then provides the final answer in the "Answer" tag, enabling it to independently explore effective inference paths during reinforcement learning. Meanwhile, the research team adopted a rule-based reward system to evaluate the answers provided by DeepSeek-R1-Zero in the experiment, ensuring the stability and scalability of the training process.

The evaluation results show that the pass@1 score of DeepSeek-R1-Zero in the AIME 2024 math competition significantly increased from the initial 15.6% to 77.9%. If the self-consistent decoding strategy is adopted, the accuracy further increases to 86.7%, exceeding the average level of human contestants.

In addition to math tasks, the model also performs outstandingly in programming competitions and graduate-level biology, physics, and chemistry problems, fully verifying the effectiveness of reinforcement learning in enhancing the inference ability of large language models.

Comparison of AIME accuracy of DeepSeek-R1-Zero during training with the average performance of human contestants (green baseline)

In addition, during the reinforcement learning process, DeepSeek-R1-Zero not only demonstrates gradually enhanced inference ability with training but also shows obvious self-evolution characteristics. Experimental data shows that when the model is driven by internal adaptation, its average inference length continuously increases during training, and it constantly corrects the inference path. It can actively pause, review, and correct existing inference steps during the inference process, achieving reflective reasoning and systematic exploration of alternative solutions.

Average response length of DeepSeek-R1-Zero on the training set during the reinforcement learning process

Furthermore, to address challenges such as poor readability and language mixing, in response to the problems of poor readability and language chaos in DeepSeek-R1-Zero, the research team developed DeepSeek-R1, and its workflow is as follows:

* Collect conversational and human-thinking-consistent cold-start data based on DeepSeek-V3 and input it into DeepSeek-R1 Dev1;

* DeepSeek-R1 Dev1 conducts reinforcement learning and sampling based on the data, and DeepSeek-R1 Dev2 incorporates inference and non-inference datasets into the SFT process;

* DeepSeek-R1 Dev3 promotes the entry into the second reinforcement learning phase to enhance the model's usefulness and harmlessness, and finally outputs the answer to DeepSeek-R1;

Multi-stage pipeline of DeepSeek-R1

From the experimental results, compared with DeepSeek-R1-Zero and DeepSeek-R1 Dev1, DeepSeek-R1 shows significant improvement in instruction execution performance at each development stage and scores higher in the IF-Eval and Arena-Hard benchmark tests.

Experimental results of DeepSeek-R1 at each stage

The First Large Model to Pass Peer Review of an Authoritative Journal

As the first LLM model to undergo peer review, the research paper of DeepSeek-R1 was featured on the cover of Nature as soon as it was published. Nature stated in the article "Bring us Your LLms: why peer review is good for AI models" that peer review is an effective way to deal with the marketing hype in the AI industry. Almost all mainstream large-scale AI models have not undergone independent peer review, and this gap "has finally been filled by DeepSeek."

In response to this, Subbarao Kanbhampati, a researcher at Arizona State University and former president of AAAI, said that he participated in this peer review and believed that this was a good trend. He hoped to see more developers of cutting-edge models follow their example and share the technical details of AI model peer review.

The US technology media Wind Info reported that compared with the initial version released in January, this paper reveals more details about the model training process and directly addresses the early distillation problem. It can be said that DeepSeek-R1 provides an example for more transparent and standardized AI research practices in the future.

References:

1. https://www.nature.com/articles/d41586-025-03015-6

2. https://www.nature.com/articles/d41586-025-02979-9

3. https://www.nature.com/articles/s41586-025-09422

This article is from the WeChat official account "HyperAI Superneural". Authors: Zihan, Libao Zhu. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The training cost is $294,000. DeepSeek-R1 is on the cover of Nature. The first mainstream large model to pass the peer review of a leading academic journal has received positive feedback.

Large-scale Reinforcement Learning to Enhance Inference Ability

The First Large Model to Pass Peer Review of an Authoritative Journal