AI won the gold medal at the International Mathematical Olympiad (IMO), but the "AlphaGo moment" in the mathematical community hasn't arrived yet.
Recently, within two days after the conclusion of the 2025 International Mathematical Olympiad (IMO) in Australia, the AI community engaged in a dual battle for talent and technological dominance due to the "IMO Gold Medal Certification."
OpenAI preemptively announced that its confidential inference model reached the gold medal threshold with 35 points. Two days later, DeepMind also presented an equivalent score officially certified by the IMO. This marks the first time AI has matched top - tier students in the IMO, achieving a leap in mathematical reasoning ability from a silver medal in 2024 to double gold medals in 2025.
Along with the technological progress comes the "drama" of industry competition: when Demis Hassabis publicly condemned OpenAI for prematurely leaking the results, the media reported that three core researchers from DeepMind's gold - medal team had been poached by Meta.
The pace of progress in AI's mathematical ability is astonishing. But what exactly does an IMO gold medal signify? Is this the AlphaGo moment for the mathematical world? Will AI become a trustworthy collaborator in mathematical research, or will it become a technological product under market logic, diluting the true meaning of mathematics?
In this article, we invited an IMO gold medalist to share their insights from a firsthand perspective on the problem - solving logic and mathematical proficiency of the two major AIs, and to explore the technological breakthroughs behind the competition and the future of mathematics.
01 The Battle between DeepMind and OpenAI: Winning the IMO Gold Medal at Different Times
When I woke up one day, I felt like I had traveled back to high school: someone in my WeChat Moments mentioned the IMO (International Mathematical Olympiad, an international mathematics competition for high school students). I remember that only the top - notch academic achievers used to take on this challenge. Recently, however, AI has achieved this feat: OpenAI and Google DeepMind announced that their models met the IMO gold medal criteria.
Although the two announcements were only two days apart, there was a lot of drama: this year's IMO concluded on Sunday, July 20th, in Australia, but OpenAI announced the news as early as Friday night, July 18th.
Researcher Alexander Wei said on X: OpenAI's latest experimental large - scale reasoning model has achieved a major long - standing challenge in the field of artificial intelligence. It solved 5 out of the 6 problems in the IMO competition and scored 35 points. The full score for the IMO is 42 points, and 35 points is exactly the threshold for a gold medal.
Two days later, DeepMind also announced that the advanced version of its Gemini Deep Think model had achieved the same feat. DeepMind's model operated entirely in natural language throughout the process and also scored 35 points, and this result was certified by the official IMO organizing committee.
IMO President Gregor Dolinar said: DeepMind's problem - solving was amazing in many aspects. The examiners found the solutions clear, rigorous, and mostly easy to understand.
OpenAI did not receive the endorsement of the organizing committee. Demis Hassabis even specifically stated on X: The reason we didn't announce the results on Friday was that we respected the initial request of the IMO organizing committee. All AI labs should only make their results public after the official results have been verified by independent experts and the participating students have received their due recognition.
He also said: Our model is the first AI system to receive an official "gold - medal level" rating - it was almost a direct dig at OpenAI. OpenAI's earlier celebration seems less justifiable now.
But even more dramatically, the next day, the media reported that three researchers from DeepMind's gold - medal model research team had been poached by Meta. In the six months prior to this, 20 employees from DeepMind had been recruited by Microsoft.
It seems that the battle between these top - tier labs is intensifying. While enjoying the drama, let's return to the topic of the IMO competition: what does it really mean for AI to reach the gold - medal level?
First of all, this is far from being the AlphaGo moment for the mathematical field. When AlphaGo defeated the world Go champion Lee Sedol, it shocked the world. The core reason was that Go was considered one of the most challenging domains for human intelligence to be surpassed by machines.
In 2022, DeepMind's AlphaFold accurately predicted protein structures, which was also hailed as the AlphaGo moment for biology. We at Silicon Valley 101 detailed its significance in last year's article "The Invasion of AI in the History of Biomedical Science."
However, this time, 72 high school students also reached the gold - medal standard, and 5 of them scored a perfect 42 points, meaning they solved all 6 problems flawlessly, while both AI models only solved 5. So, it's still too early to say that AI has surpassed humans in mathematical ability.
But even if it hasn't reached the AlphaGo standard, the IMO gold - medal results are sufficient to prove the excellent mathematical ability of current large - scale models. Computer professors Gary Marcus and Ernest Davis from New York University commented that it was "very impressive."
02 The IMO as an Ability Standard: Proving AI's Mathematical Reasoning Ability
Using the ability to solve IMO problems as a standard for evaluating AI's reasoning ability is not a new concept.
For example, last year, DeepMind released two models specifically designed for mathematics: AlphaGeometry and AlphaProof. They solved 4 out of the 6 IMO problems, becoming the first AI systems to reach the silver - medal standard.
Image source: Google DeepMind
However, these two models did not solve the problems in natural language at that time. Instead, they combined the "formal proof" method. Simply put, a formal proof is to convert a mathematical problem into a language that machines can "understand," and then the AI uses this formal language to write a logically rigorous and verifiable solution step by step.
The writing tool for this language is called Lean (a modern theorem - proving assistant and functional programming language developed by Microsoft Research), similar to a programming language.
To enable the AI to solve problems, researchers had to first "translate" the natural - language problems into Lean for the AI to process and then convert them back into human - readable answers. The entire process took up to three days - far exceeding the 9 - hour, two - day competition limit for high school students in the IMO.
But this time, DeepMind's latest Gemini Deep Think model achieved the IMO gold - medal standard with full natural - language input and output. That is, the AI directly read the problems in natural language and provided answers in natural language - without relying on Lean or other formal tools. This has significant implications.
For a long time, many people have believed that language models do not have real reasoning abilities. For example, if you ask it, "How many 'r's are there in the word'strawberry'?" it might start to "struggle" and make calculation errors. Since natural language lacks a clear logical structure, the reasoning process is unstable. This is why models like AlphaProof in the past needed to convert natural language into Lean to avoid the uncertainty of language.
But now, DeepMind has proven that language models themselves can perform high - level mathematical reasoning. Although neither DeepMind nor OpenAI has made the specific training process of their models public, this is undoubtedly a significant advancement compared to a year ago.
Li Yuanshan
Ph.D. candidate in logic at the University of Notre Dame:
We all know that AI currently learns a set of parameters from a large amount of data through various techniques. It's not about pre - defining a set of logical rules and then having the AI execute them. Similarly, in the early days of using computers for mathematics, people thought that formalizing all of mathematics and applying these rules was the way to solve mathematical problems. But now, we see these companies trying to combine the two approaches, or even directly using language models to output natural - language mathematics without relying on a formal system.
Previously, AI scholars represented by Gary Marcus always believed that language models could not independently perform real mathematical reasoning. In his view, an AI model must rely on a formal language like Lean to output a machine - verifiable logical structure, which is then manually converted into natural language. That is, only a "hybrid model" like AlphaProof could potentially meet the standards of mathematical research.
Therefore, the success of Gemini Deep Think undoubtedly challenges Gary Marcus's view to some extent.
Li Yuanshan
Ph.D. candidate in logic at the University of Notre Dame:
You can see that the solutions generated by DeepMind's model are entirely in natural language, without any code. Compared to the system used last year, which might also output natural language in the end, but first needed to translate the problems into a logical language, conduct a formal proof, and then output the result, this is a significant difference.
Previously, mathematicians might have equated computer - assisted mathematics with formal methods. But with the development of these language models and their demonstration of certain mathematical abilities, they may change their minds.
03 A Former IMO Gold Medalist's Review: Differences in Problem - Solving between OpenAI and DeepMind
To allow everyone to intuitively compare the solutions of AI and human contestants, we invited Hu Sulin, a former member of the Chinese IMO national team, to share his impressions of the AI's answers.
He told us that the AI had clear problem - solving ideas and a complete logical chain in the five problems it answered, and it truly deserved a full score.
However, when comparing the answers of the two AIs to specific problems, some interesting differences emerged. For example, in the second problem, a plane geometry problem.
Hu Sulin
2019 IMO gold medalist:
Plane geometry problems are one of the easiest types for AI to solve. Here, the two AIs adopted different approaches. DeepMind's method was more geometric and natural, and I think it was closer to what a normal human contestant would come up with. In contrast, OpenAI's method was very brute - force, as it directly used analytic geometry. It directly converted the geometry problem into an algebra problem, and there was a huge amount of calculation in its solution process. Usually, human contestants would not perform such a large amount of calculation in the exam. So this method might be easier for AI to implement than for human contestants.