GPT-5 is in danger. DeepSeek has open-sourced the world's first AI with an International Mathematical Olympiad gold medal, directly challenging Google.
DeepSeek, which has been silent for a long time, is back! Today, DeepSeekMath-V2 made a grand debut and won the gold medal in IMO 2025 at one stroke. Its strength is comparable to, and even surpasses, Google's IMO gold medal model. Open-source AI has regained the upper hand once again.
DeepSeek is back again!
Just now, DeepSeek officially released the new model DeepSeekMath-V2, which won the gold medal in IMO 2025 at one stroke.
The most crucial thing is that this is the first "open-source IMO gold medal model".
Built on DeepSeek V3.2 Exp Base
Currently, the two models that have officially announced winning the gold medal are Google's Gemini Deep Think and an internal model from OpenAI.
In IMO-ProofBench, DeepSeekMath-V2 demonstrated powerful theorem-proving capabilities:
IMO 2025: Solved 5 out of 6 questions, reaching the gold medal level;
CMO 2024 (China Mathematical Olympiad): Reached the gold medal level;
Putnam 2024: Scored 118 points, close to the full score of 120 points, surpassing the highest score of human contestants (90 points).
Moreover, on ProofBench-Basic, DeepSeekMath-V2's strength crushed Google's gold medal model - Gemini Deep Think; on ProofBench-Advanced, it closely chased Google.
In the paper, the team trained an LLM-based verifier as a reward function and used it to train the model to solve problems autonomously.
Moreover, they also scaled up the computing power of the verifier to annotate more complex proofs and further optimized the verifier itself.
This method is very ingenious and can effectively bridge the gap between generation and verification.
The results empirically prove that "verifiable mathematical reasoning" is a feasible research direction in the future.
DeepSeekMath-V2 Makes "Self-Verification" the Most Powerful Weapon
The paper of DeepSeekMath-V2 has also been released on GitHub simultaneously.
The core breakthrough brought by DeepSeekMath-V2, the latest release from DeepSeek, is Self-Verification.
This not only enables it to sweep top human contestants in the most difficult math competitions but, more importantly, it reveals an inevitable path to more advanced AI - learning to self-reflect.
Why Looking Only at the Results Is Not Enough
In the past, the method of training AI to solve math problems was very simple: give it a problem, and if the answer it calculates is consistent with the standard answer, reward it.
This is very effective in simple calculation problems (such as the AIME competition).
However, at the level of the International Mathematical Olympiad (IMO), the pearl on the crown of mathematics, this method completely fails.
Because IMO problems often do not have simple numerical answers but require you to write a logically impeccable proof process.
Previously, AI was often a "big bluffer" here. It could fabricate a bunch of seemingly professional mathematical jargon and finally force a conclusion. Although it might guess the result correctly, the process was full of loopholes.
DeepSeekMath-V2 decided to fundamentally change the rules. It not only rewards correct answers but also rewards a rigorous "self-checking" process.
Secret Weapon: The Triune System of Mutual Challenge
To achieve this "self-reflection", DeepSeek designed a sophisticated "mutual challenge" system, like having three people living in the AI's brain:
1. "Problem Solver" (Generator, proof generator):
Responsible for solving problems and writing proofs.
But different from the past, it is trained to not only write answers but also write a "self-evaluation". It must honestly say, "I'm a bit unsure about this step; it might be wrong."
The research team ingeniously designed the rewards, resulting in the following incentive effects:
- Honestly facing mistakes is more beneficial than "insisting that you are right".
- Writing a truly correct proof and accurately identifying its rigor can receive the highest reward.
- For the generator, the optimal strategy is to find and correct as many problems as possible before the final answer.
2. "Iron-Fisted Judge" (Verifier, proof verifier):
This is a scoring model specially trained by DeepSeek. It doesn't look at whether the answer is correct but specifically focuses on picking faults in the proof process. It will score the proof (0 points, 0.5 points, 1 point) like an exam invigilator and point out specific logical loopholes.
- 1 point: The proof is complete and rigorous, and all key reasoning steps are clearly and fully demonstrated;
- 0.5 points: The overall idea is correct, but there are minor errors or some demonstrations are omitted in the details;
- 0 points: There are fatal logical errors or key gaps, making the proof essentially invalid.
3. "Auditor of the Judge" (Meta-Verifier, meta-verifier):
This is the most ingenious step. Because the "judge" may also make mistakes or be lazy and make random judgments.
So DeepSeek introduced a "meta-verification" mechanism to specifically check whether the "judge" is randomly picking faults. If the "judge" points out a non-existent error, it will be "punished" by the "auditor".
The "meta-verifier" checks the analysis given by the verifier, including:
1. Whether the problems pointed out by the verifier actually exist in the original proof;
2. Whether these problems are sufficient to reasonably support the score it gives and comply with the original scoring rules.
The average quality score of the analysis output by the verifier, evaluated by the meta-verifier, has been increased from 0.85 to 0.96, while maintaining the original scoring accuracy.
With the cooperation of these three, DeepSeekMath-V2 can even set questions for itself, solve them, correct them, and redo them without a standard answer.
First, a positive "closed-loop" has been formed between the proof verifier and the proof generator:
- The verifier provides reward signals to the generator, thereby continuously improving the generator's proof ability;
- As the generator's level improves, it will produce more and more "tricky" new proofs, which in turn will expose the verifier's weak points that have not been covered.
Especially those proof samples for which the verifier failed to detect problems on the first attempt are of extremely high value for further training the verifier.
To efficiently obtain the correctness labels of new proofs, the research team designed an automated label generation process:
In the last two rounds of training iterations, this fully automated annotation pipeline has completely replaced manual annotation. Subsequent quality checks show that the automatically generated labels are highly consistent with the judgments of human experts.
Clash of the Titans: DeepSeek vs Gemini
DeepSeek is not alone in this field.
Gemini Deep Think from Google DeepMind is also a top contestant who has just reached the IMO gold medal level.
The comparison between the two is very interesting:
- DeepMind is like a noble with endless resources. Its strength is unquestionable, and it still maintains the lead in some advanced benchmark tests (such as IMO-ProofBench Advanced).
- DeepSeek is like a genius teenager who emerged out of nowhere. According to DeepSeek's paper, their V2 model has overtaken Gemini Deep Think on the basic test set (ProofBench Basic) and demonstrated amazing dominance in public competition questions.
More importantly, DeepSeek open-sourced this technical path and detailed the training method.
This serves as a wake-up call for AI researchers around the world: On the road to AGI, self-verification may be more important than simply piling up computing power.
Catching Up with Google and OpenAI, the Open-Source IMO Model Wins
Behind this amazing achievement is a certain "counterintuitive" evolutionary characteristic demonstrated by DeepSeekMath-V2 in experiments.
The Ability to "Get It Right the First Time": Completely Crushing GPT-5 and Gemini
If we strip away all the complex processes of repeated thinking and verification and only look at the model's "first intuition" - that is, the so-called One-Shot ability, DeepSeekMath-V2 still shows dominant strength.
The research team built an internal test set CNML containing five major categories of difficult problems in algebra, geometry, number theory, combinatorics, and inequalities (with a difficulty level comparable to the China High School Mathematics League).
In this arena, DeepSeekMath-V2 went head-to-head with the two currently strongest reasoning models on the market - GPT-5-Thinking-High from OpenAI and Gemini 2.5-Pro from Google DeepMind.
The results are shown in the figure below:
DeepSeekMath-V2 didn't just win narrowly; it won completely:
- Algebra: Far exceeded GPT-5 and Gemini;
- Geometry: The score was almost three times that of Gemini 2.5-Pro;
- Number theory and combinatorics: Also firmly in the first echelon.
This shows that even without giving the model the opportunity to "think more", its underlying ability is extremely strong.
The Key to Evolution: Letting the Model "Think a Few More Times"
What really makes DeepSeekMath-V2 different is its performance in continuous correction experiments.
When facing difficult problems at the level of IMO shortlist questions, the model often cannot write a perfect proof at once.
The experiment shows that if the model is allowed to "self-verify" - that is, after generating an answer, it checks itself for problems and then regenerates with the problems in mind - a miracle happens:
- Initial state (1 iteration): The model's average score was 0.15.
- Repeated thinking (8 iterations): When the model was allowed to perform a maximum of 8 "self-corrections", the quality score of the proof soared to 0.27.
More interestingly, if the model is allowed to choose the best one from the 32 solutions it generated (Best@32), its scoring accuracy is extremely high, and the score directly jumps to 0.42.
This confirms a key point: The model can not only correct mistakes but also has a very good sense of self-awareness. It clearly knows which answer of its own is the best.
The Crystallization of Brutal Aesthetics and Wisdom: High-Compute Search
The "miracle" of scoring 118 points (close to the full score) in the Putnam Mathematical Competition, mentioned earlier, is not just due to luck but benefits from a "High-Compute Search" strategy.
The DeepSeek team adopted an extremely strict testing method in the experiment:
1. A large number of candidates: Initially generate 64 candidate proofs for each problem