HomeArticle

Gemini Wins Another Gold Medal, Outperforming University Academic Achievers. The Era of AI Mathematical Reasoning Has Arrived.

新智元2025-08-12 08:53
Gemini won the gold medal in the IMC Math Competition and outperformed human college students by a large margin.

Gemini's gold medal in the Olympiad is well-deserved! A doctoral student from ETH Zurich tested three modes of Gemini in the International Mathematics Competition for University Students (IMC). Its performance far exceeded the gold medal threshold of the top 8% and was much better than that of ordinary university students.

Are university students' math skills inferior to AI?

Recently, on MathArena, Jasper Dekoninck, a doctoral student at the SRI Laboratory of the Swiss Federal Institute of Technology in Zurich, launched a new competition: the International Mathematics Competition for University Students (IMC).

Finally, the LLM won with a high score: the language model took the top spot in the international mathematics competition.

Gemini far exceeds the level of ordinary university students

The International Mathematical Olympiad (IMO) has always been regarded by researchers as a touchstone for the mathematical reasoning ability of AI systems.

Not long ago, in the recently held IMO competition, Google, OpenAI and others successively announced that their LLMs had achieved gold medal - level results in the IMO.

However, due to the opacity of the winning AI systems and the limited interpretability of the results, these gold medals have sparked extensive doubts and attention.

This time on MathArena, the performance of AI in the undergraduate mathematics competition was evaluated for the first time. It was not only to prove that AI is indeed capable of winning the IMO gold medal, but also to verify whether the excellent performance of AI in high - school level competitions (such as the IMO) can be translated into success in university level competitions.

This test evaluated a total of three systems: Gemini Deep Think IMO (the gold medalist of IMO 2025), Gemini - 2.5 - Pro, and the Gemini - 2.5 - Pro Best - of - 32 baseline.

Since the model of OpenAI that won the gold medal has not been released, it cannot be evaluated.

The test results showed that all three systems obtained extremely high scores, far exceeding the gold medal threshold of the top 8%.

Both Gemini Deep Think and Gemini Agent successfully solved all the problems, with only a few minor errors. These errors were usually caused by incomplete arguments in intermediate steps or incorrect references to known theorems.

Interestingly, the performance of Gemini Best - of - 32 was much better than that in IMO 2025. It only made a major mistake in one question (P5). This may be because the IMC has a higher knowledge density, and large - scale AI models often perform well in such an environment.

Three major conclusions were drawn from this test:

Conclusion 1: All three models obtained high scores in the IMC competition. Gemini Deep Think and Gemini Agent gave mostly correct answers to all the questions. Their scores are comparable to those of excellent human university student participants.

Conclusion 2: Considering the quality and clarity of the proofs comprehensively, the judges ranked the models as follows: Gemini Deep Think > Gemini Agent > Gemini Best - of - 32.

Conclusion 3: Qualitative analysis of the results showed that Gemini Deep Think performed particularly well. The proofs it provided were much clearer and more interesting than those of other models. It sometimes proposed truly interesting methods, while other autonomous systems usually used computationally intensive methods.

However, since this evaluation was an ad - hoc addition, the scale of the evaluation was relatively small. Each model was evaluated only once for each question, and there was only one judge.

What's the value of the IMC gold medal?

The International Mathematics Competition for University Students (IMC) is hosted by University College London in the UK and organized by the American University in Bulgaria. The competition will be held from July 28th to August 3rd, 2025, in Blagoevgrad, Bulgaria.

This competition is open to students who are currently pursuing undergraduate studies (from the 1st to the 4th year). The upper age limit for participants is 23 years old, and special cases can be considered on a case - by - case basis. There is no minimum age limit.

The test questions cover areas including algebra, analysis (real analysis and complex analysis), geometry, and combinatorics. The language of the competition is English.

The IMC lasts for two days, with 5 questions each day, and each question is worth 10 points.

IMC competition schedule: https://www.imc - math.org.uk/?year=2025&item=problems

This time, a method similar to the evaluation of the 2025 US Mathematical Olympiad was adopted, with only a few adjustments.

Paper link: https://arxiv.org/abs/2503.21934

Two experienced judges were recruited to evaluate the works submitted by the models.

To avoid contamination, the scoring work started immediately after the IMC 2025 questions were announced. Each judge independently established a scoring standard for the questions and scored the anonymously submitted works, with a full score of 10 points.

Each model was evaluated separately for all the questions using the same scoring standard.

The problem - solving time determines the computing power and the cost of using large - scale models. For this reason, Jasper Dekoninck answered the time - consuming situations of the three systems respectively.

Due to limited time, other models will not be tested for the time being, but it is very likely that these models will also achieve excellent results in this competition.

The new evaluation is crucial for truly testing the capabilities of the models. Some netizens are already eager to see the performance of o3 - Pro, Claude, and Kimi K2 in the IMC exam.

Result analysis

In addition to quantitative scoring, the researchers also extracted many qualitative observations and insights from the model outputs to help understand the performance of each model in mathematical reasoning tasks more comprehensively.

Gemini Deep Think: The prover with the strongest clarity

For mathematical reasoning, clear expression is not only the basis for reviewers to score but also reflects the depth of the model's understanding of the problem. Although many solutions of Gemini Best - of - 32 were technically correct, the expressions were often confusing, lacking clear structure and effective logical organization, making it difficult to follow the train of thought.

In contrast, Gemini Agent had better logic, but its proofs were often too long and dense. This verbose style may be due to its "self - verifying feedback loop", that is, the model tends to over - explain each step.

However, Gemini Deep Think performed even better: the proofs it provided were concise in language, clear in structure, and reasonable in steps. It could arrange an appropriate level of detail for each step, making it easier for readers to understand its reasoning process.

Gemini Deep Think: Demonstrated true original thinking

A common approach of AI models is to rely on "bashing", that is, using complex algebraic operations instead of mathematical insights. This was particularly evident in the solutions of Gemini Agent and Gemini Best - of - 32, especially in question 9.

However, the strategy of Gemini Deep Think was more elegant and innovative:

The proof of question 7 stood out for its high level of simplicity and beauty, far surpassing other models; in question 9, it gave a simpler and more inspiring idea than the official solution; in question 10, it used more advanced mathematical tools and gave a stronger upper bound for a key variable. However, due to skipping some reasoning details, it only got 7 points (out of 10) for this question.

Official relevant solutions: https://www.imc - math.org.uk/imc2025/imc2025 - day2 - solutions.pdf

Ability to mobilize advanced mathematical knowledge

The performance of the models in question 5 is also worthy of attention. This question involved the proof of an inequality of a function. Although the name of the function was not given in the question, it was actually the famous Landau function.

Surprisingly, all three models could accurately identify the function and use its known properties to construct a complete proof, demonstrating their depth and accuracy in knowledge mobilization.

Netizens: o3 can finish the exam in just 10 minutes

Regarding the difficulty level of the IMC questions, Jasper Dekoninck believed that the difficulty of the most difficult questions in the IMC was comparable to that of the most difficult questions encountered during undergraduate studies.

Netizen Dmitry Rybin showed great enthusiasm for the test: "Great, I wanted to send you the questions, but you've already done it."

He also tested all the IMC 2025 questions with o3, and it finished 10 questions in about ten minutes.