DeepSeek makes a strong comeback and open-sources an IMO gold medal-level math model
A breakthrough-level reasoning model has arrived. DeepSeek has opened up the direction of self-verifiable mathematical reasoning.
The whale is back!
Just now, DeepSeek quietly uploaded a new model on Hugging Face: DeepSeek-Math-V2.
As the name suggests, this is a model in the field of mathematics. Its previous version, DeepSeek-Math-7b, was released more than a year ago. At that time, with only 7B parameters, this model achieved a performance comparable to that of GPT-4 and Gemini-Ultra. The relevant paper also introduced GRPO for the first time, significantly improving the mathematical reasoning ability.
After a year and a half, what surprises does DeepSeek-Math-V2, developed based on DeepSeek-V3.2-Exp-Base, bring?
DeepSeek says that its performance is better than that of Gemini DeepThink, reaching the level of an IMO gold medalist.
- Paper title: DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
- Model address: https://huggingface.co/deepseek-ai/DeepSeek-Math-V2
- Paper address: https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf
- Core authors: Zhihong Shao, Yuxiang Luo, Chengda Lu, Z.Z. Ren
At the beginning of the paper, DeepSeek points out the limitations of current AI research in mathematical reasoning: using the correct final answer as a reward and overly pursuing the accuracy of the final answer.
Although this approach can enable reasoning models to reach a higher level on benchmarks such as AIME and HMMT, and even reach saturation, DeepSeek says that this does not solve the core problem: a correct answer does not guarantee a correct reasoning process. In addition, many mathematical tasks (such as theorem proving) require rigorous step-by-step derivations, not just numerical answers, which makes the reward method based on the final answer inapplicable.
To push the limits of deep reasoning, DeepSeek believes it is necessary to verify the comprehensiveness and rigor of mathematical reasoning.
They point out: "Self-verification is particularly important when scaling up the computational scale during extended testing, especially for open-ended questions without known solutions."
To achieve self-verifiable mathematical reasoning, DeepSeek studied how to train an accurate and reliable LLM-based theorem proof verifier. Then, they used this verifier as a reward model to train the proof generator and encouraged the generator to discover and solve problems in its own proof as much as possible before finally completing the proof.
To maintain the generation-verification gap as the generator's ability increases, DeepSeek proposes to expand the verification computing power to automatically annotate new hard-to-verify proofs, thereby generating training data to further improve the performance of the verifier.
In simple terms, the core goal of DeepSeek's paper is not just to make AI get the questions right, but to make AI "not only be able to solve problems, but also check itself, and even honestly admit where it went wrong."
To achieve this, they designed a system consisting of three key roles, which we can understand through an analogy of "student - teacher - supervisor":
First, train a qualified "examiner" (Proof Verification).
In the past, when training AI mathematical models, people usually only looked at whether the final answer was correct. However, in advanced mathematical proof questions (such as those in the Olympiad), the rigor of the process is more important than the answer. Therefore, the DeepSeek team first trained a dedicated verifier, which is like an "examiner." This examiner doesn't just give a tick or a cross but learns to classify the proof process into three grades like a human expert:
- 1 point: Perfect, with rigorous logic.
- 0.5 point: Generally correct, but with minor flaws or missing details.
- 0 point: There are fundamental logical errors or serious omissions.
Not only give a score, but also write comments: The model is required to write an analysis before giving a score, pointing out what is good and what has problems.
Next, assign a "supervisor" to the teacher (Meta-Verification).
DeepSeek found a problem: Sometimes, the examiner may randomly deduct points. It may give a low score, but the errors it points out actually don't exist (i.e., it has hallucinations).
To solve this problem, they introduced a meta-verification mechanism, which is like assigning a "supervisor" to the teacher. The supervisor's task is not to look at the test papers but to specifically check whether the "comments" written by the teacher are reasonable. In this way, there is a double confirmation: the supervisor will check whether the errors pointed out by the teacher actually exist and whether the point deduction is logical. In effect, by training the model to be both a teacher and a supervisor, the accuracy and credibility of the AI's evaluation of proofs are greatly improved.
Then, train a "self-reflective" student (Proof Generation with Self-Verification).
With a good examination system, the next step is to train the "student" (generator) who solves the problems. Here is a very key innovation: the honest reward mechanism. That is to say, it not only solves the problems but also self-evaluates: after outputting the problem-solving process, the model must immediately follow it with a "self-evaluation" and give itself a score (0, 0.5, or 1).
It rewards honesty:
- If the model gets the problem wrong but honestly points out its own mistakes in the self-evaluation, it will be rewarded.
- On the contrary, if it gets the problem wrong but insists that it is right (blindly confident) or tries to "get away with it," it will be punished (not get a high reward).
The purpose of this is to force the AI to think deeply before outputting an answer, try to discover and correct its own mistakes until it thinks it has really solved the problem correctly.
Finally, form an automated closed-loop (Synergy).
Human experts can't give detailed step-by-step scores to thousands of Olympiad math questions, so DeepSeek designed an automated process to let the system "fight against itself" for self-evolution:
- Massive generation: Let the "student" generate many solutions to the same question.
- Collective voting: Let the "teacher" evaluate these solutions multiple times. If most evaluations think there is a problem with a certain solution, then it is judged to have a problem; if no loopholes are found, it is judged to be correct.
- Improve through practice: In this way, the system automatically screens out those questions that are difficult to grade or difficult to solve correctly and turns them into new teaching materials to retrain the "teacher" and the "student." In this way, as the "student's" problem-solving ability gets stronger, the "teacher's" judgment becomes more and more discerning.
In short, the method of DeepSeekMath-V2 essentially shifts from "result-oriented" to "process-oriented." It doesn't rely on a large amount of data on the answers to math questions but teaches AI how to rigorously review the proof process (including reviewing itself) like a mathematician, so that it can continuously improve its ability to solve high - difficulty mathematical proof questions without human intervention.
Finally, they got the DeepSeekMath-V2 model, which shows strong theorem - proving ability: it achieved gold - medal - level results in IMO 2025 and CMO 2024, and achieved a score of 118/120, close to a full score, in Putnam 2024 through extended test calculations.
The following figure shows the performance of DeepSeekMath-V2 on the IMO-ProofBench benchmark (a subset of the IMO Bench, which contains 60 proof questions). As can be seen, on the Basic benchmark, DeepSeekMath-V2 far outperforms other models and even reaches an amazing high score of nearly 99%. On the more difficult Advanced subset, DeepSeekMath-V2 is slightly inferior to Gemini Deep Think (IMO Gold).
DeepSeek says: "Although there is still a lot of work to be done, these results show that self-verifiable mathematical reasoning is a feasible research direction, which is expected to promote the development of more powerful mathematical AI systems."
This self-verifiable mathematical reasoning framework can be said to have broken through the limitations of traditional reinforcement learning (RL), making the model no longer rely on the correctness of the final answer as the only reward but focus on the rigor of the reasoning process. In addition, the two-way improvement cycle of the verifier - generator collaboration in DeepSeekMath-V2 brings comprehensive and rigorous mathematical reasoning ability and significantly reduces the hallucinations of large models.
In the paper, DeepSeek introduces more technical details. Students who are interested can read it carefully.
This article is from the WeChat official account “Almost Human” (ID: almosthuman2014). The author is Almost Human. It is published by 36Kr with authorization.