DeepSeek has updated the R1 paper by over 60 pages. Is V4 about to be released?
On January 4th, DeepSeek updated its R1 paper on arXiv.
There was no official announcement and no tweet, only the version number changed from v1 to v2. But those who opened the PDF would notice a change: the number of pages swelled from 22 to 86, and the file size increased from 928KB to 1562KB.
The newly added content includes a complete breakdown of the training pipeline, detailed data from over 20 evaluation benchmarks, and a technical appendix spanning dozens of pages. It's almost like a major rewrite.
The timing is also a bit delicate. January 20th marks the first anniversary of the release of R1. Looking further ahead, February 17th is the Lunar New Year. DeepSeek has a tradition of making big announcements before the Spring Festival. Last year, both V3 and R1 were released during the Spring Festival window.
Is this major update to an "old" paper a prelude to new actions? To answer this question, let's first take a look at what's actually written in these 86 pages.
The "Three Lifetimes" of a Paper
To understand the significance of this update, we need to first review the complete journey of the R1 paper.
On January 20th, 2025, a 22 - page preprint was released. DeepSeek published the R1 paper on arXiv. The core conclusion was that pure reinforcement learning could enable large models to "learn" reasoning on their own, without the need for human - annotated thought - chain data. The paper, the model, and the method were all open - sourced, which directly detonated the global AI community.
On September 17th, 2025, the R1 paper made it onto the cover of Nature, with Liang Wenfeng as the corresponding author. This was the world's first mainstream large model to pass peer review in a top - tier academic journal. Eight experts reviewed the paper, raising questions point by point, and DeepSeek responded to each one. The reviewers' concerns included whether R1 used the outputs of OpenAI models for training (the so - called "distillation" suspicion), the specific sources of the training data, and details about security. In its response, DeepSeek clearly denied the distillation accusation and for the first time disclosed the training cost. It only cost $294,000 to train from V3 - Base to R1.
Nature wrote a special editorial for this, pointing out that mainstream large - model companies usually do not undergo independent reviews when releasing models. "This gap has changed with Nature's publication of the details of DeepSeek - R1."
On January 4th, 2026, the 86 - page full - version went online. The latest version synchronizes the technical details from the Nature version back to arXiv. The complete breakdown of the Dev1, Dev2, and Dev3 training stages, the expanded evaluation data, and the technical documents in Appendices A - F are now freely available to everyone.
It is a common practice in the academic community to synchronize and update preprints after journal publication. However, expanding from 22 pages to 86 pages, almost quadrupling the content, is a rather rare update. To some extent, DeepSeek has turned a paper into a technical encyclopedia. It wants everyone to be able to reproduce R1, not just understand it.
What's New? Breaking Down the 64 - Page "Increment"
The Training "Black Box" Opens: Dev1, Dev2, and Dev3 Are Exposed for the First Time
The original paper was quite restrained in its description of the training process: cold - start SFT → reinforcement learning → final SFT. It completed these three steps without going into details. The new version completely breaks down this pipeline and introduces three intermediate checkpoints: Dev1, Dev2, and Dev3.
The Complete Training Pipeline of R1
Dev1 is the product of the cold - start stage. At this stage, the model learned to follow instructions (the instruction - following ability was significantly improved), but at the cost of a decline in reasoning ability. The data disclosed in the paper shows that Dev1 performed worse than the base model in the AIME math competition.
Dev2 is specifically designed to "rescue" reasoning ability. At this stage, only reasoning - oriented RL is carried out to bring back the mathematical and coding abilities while maintaining the level of instruction - following.
Dev3 is the final refinement. High - quality data is generated through rejection sampling, and then another round of SFT is performed to enable the model to output stably in both reasoning tasks and general tasks.
This three - stage process of first teaching rules, then training internal strength, and finally adjusting the form explains a question that many people are concerned about: why R1 can perform long - chain reasoning without outputting in a chaotic and Chinese - English - mixed way like R1 - Zero.
From 5 Benchmarks to 20 +: The Evaluation System Is Fully Expanded
The evaluation in the original paper focused on a few core indicators: the AIME math competition, Codeforces programming, and the MATH dataset. The new version significantly expands the evaluation scope, covering more than 20 benchmarks such as MMLU, MMLU - Pro, DROP, GPQA Diamond, IFEval, Arena - Hard, SWE - bench Verified, and LiveCodeBench.
The Training Curve of R1 - Zero: The Accuracy Increased from 15.6% to 77.9%, Exceeding the Human Level (Green Dotted Line)
What's even more noteworthy is the introduction of the human baseline. The new - version paper directly compares R1's AIME scores with the average scores of human participants. During the training process of R1 - Zero, the pass@1 increased from 15.6% to 71.0%, and reached 86.7% after using majority voting, exceeding the average human level.
This evaluation method of comparing with humans can illustrate the problem better than simply ranking on leaderboards.
The RL Alchemy Manual: The "Secrets" in Appendices A - F
For researchers who want to reproduce R1, the newly added appendices may be the most valuable part.
Appendix A details the implementation of GRPO (Group Relative Policy Optimization), including key hyperparameters such as the learning rate, KL coefficient, and sampling temperature. Appendices B - F cover the design of the reward function, data construction strategies, and evaluation details. The original paper had a strong "methodology" flavor, while the new version is more like an operation manual, with parameters fixed, processes clarified, and pitfalls marked.
As a technical interpretation put it, different from the original version, which focused on high - level methodologies and results, the appendices of the new version provide a complete and transparent guide for anyone who wants to understand how the model works.
The Failed Attempts Written into the Paper
The new - version paper also has an easily overlooked section: Unsuccessful Attempts.
DeepSeek admitted that they tried MCTS (Monte Carlo Tree Search) and PRM (Process Reward Model). These two routes were the hottest research directions in the industry in the past year, and many top - tier laboratories had placed big bets on them. The result was that they didn't work, at least not in general reasoning tasks.
The paper explains that such methods have extremely high requirements for the "step granularity" and are suitable for scenarios like mathematical proofs where each step can be clearly verified, but it's difficult to generalize them to more open - ended reasoning tasks. This is in line with the discussions in the developer community. PRM and MCTS may limit the exploration space of reinforcement learning and are only suitable for problems with clear boundaries.
Writing failures into a paper is not uncommon in the academic community, but it's quite rare in large - model research dominated by the industry. To some extent, DeepSeek has demystified the whole industry. The directions that the giants are struggling with may not be the right ones.
From 22 pages to 86 pages, what DeepSeek has supplemented is reproducibility. This also raises a question: why choose this time point to do this?
Why Now?
It's common in the academic community to synchronize the content back to the preprint after journal publication. However, there are still a few interesting points about this update of the R1 paper.
First, there's the timing. The paper was updated on January 4th, January 20th is the first anniversary of the release of R1, and February 17th is the Lunar New Year. Stringing these three dates together, it's hard not to make associations. Last year, both V3 and R1 were released during the Spring Festival window. DeepSeek seems to have formed a kind of "New Year's goods" tradition. Many people on X are already asking: "Will we hear news from the 'whale' soon?"
Second, the update itself is abnormal. Most papers remain unchanged after publication, at most correcting typos. Adding more than 60 pages at once and making all the internal implementation details, ablation experiments, and even failed attempts public is quite rare in the AI industry, which values competitive moats.
How to understand this "abnormality"? One interpretation is that these technologies no longer provide a competitive advantage for DeepSeek's current research, and they have moved on to newer directions. Combining this with the mHC architecture paper published on January 1st, the outline of the next - generation model seems to be emerging.
Another interpretation is a defensive open - source strategy. By completely disclosing the technical details from a year ago and turning them into public knowledge, it can prevent competitors from patenting similar technologies or building barriers. Instead of letting the technology of R1 be gradually diluted in closed - source competition, it's better to actively release it to raise the overall level of the open - source community.
There's also an easily overlooked detail: the author list. The paper uses asterisks to mark those who have left the team, but among more than 100 contributors, only 5 have asterisks, and all 18 core authors are still there after a year. Even more interestingly, an asterisk next to a researcher who used to have one has disappeared, indicating that they seem to have rejoined the team. The almost zero - loss of the core team is also quite rare in the highly competitive AI talent market.
Looking back at the past year, DeepSeek's rhythm has always been to publish a paper first and then release a model. The V3 paper detailed the MoE architecture and MLA attention mechanism, the R1 paper disassembled the pure RL training framework, and the mHC paper optimized the training stability issue. Each paper was not a post - event summary but a pre - paving. In a sense, this 86 - page update follows the same logic. Before the next big move, the technical debts from the previous stage are completely cleared.
As for what that "big move" is and when it will come, the answer may be revealed soon.
This article is from the WeChat official account "Silicon Star People Pro", written by Zhou Yixiao and published by 36Kr with authorization.