Even Terence Tao was shocked. o3 won the championship by a large margin in its first appearance in the "AI Math Olympiad". The open-source group chased after OpenAI closely, trailing by only 5 points.
The "AI Olympiad Cup" has restarted! OpenAI's o3 made its debut in the competition. With full computing power, it directly stunned the audience with an astonishing score of 47 out of 50. It's worth mentioning that the combined score of the top five models was only 5 points behind o3, and the gap between open - source and closed - source models has narrowed again.
The NVIDIA team (NemoSkills) won the first place in the second "AI Olympiad" competition!
This time, the organizing committee of AIMO2 restarted the competition questions. OpenAI's o3 participated in the competition for the first time and achieved the most outstanding result.
Tao Zhexuan said excitedly that in the past, this competition was limited to open - source models, and the computing resources were also quite restricted.
Fortunately, in the second round of the AIMO competition, NemoSkills, Tsinghua - Microsoft's imagination research, and o3 participated in the competition simultaneously.
The test was divided into two conditions: one provided similar computing resources, and the other allowed unrestricted computing power.
As expected, the more computing power was provided, the better the model performed.
With sufficient computing power, OpenAI's o3 scored directly up to 47 points (out of 50). Even if each question was given two chances, it could even reach a full score.
Another interesting situation is that when the computing resources are the same, the difference between open - source models and commercial models is actually not significant.
Today, the complete research test report was officially released.
Report link: https://aimoprize.com/updates/2025 - 09 - 05 - the - gap - is - shrinking
Let's take a look at o3's performance in the specific actual test.
The toughest Olympiad problems, AI's top performer
For the scientific reproducibility, it is crucial to ensure the wide availability of open - source models. But how big is the performance gap between open - source and closed - source models?
In the context of mathematical reasoning, this evaluation provides a more detailed understanding:
In mathematical reasoning at the Olympiad level, the gap between commercial and open - source AI is narrowing.
Open - source models are about to catch up with commercial models.
Last year, Epoch AI estimated that the current best open - source models are comparable to closed - source models in terms of performance and training computing power, but there is a gap of about one year.
The Artificial Intelligence Mathematical Olympiad (AIMO) was founded in 2023, aiming to promote the development of open - source AI models in high - order mathematical reasoning.
Competition link: https://www.kaggle.com/competitions/ai - mathematical - olympiad - progress - prize - 2/overview
In April 2025, the second AIMO Progress Prize 2 (AIMO2) came to an end.
The difficulty of the questions in this stage was further increased, mainly focusing on the Olympiad levels of various countries (such as the British Mathematical Olympiad BMO and the US Mathematical Olympiad USAMO).
The top five teams and their scores on the private leaderboard of AIMO2 are as follows (scores on the public leaderboard are in parentheses):
- NemoSkills: 34/50 (Public leaderboard: 33/50)
- imagination - research: 31/50 (Public leaderboard: 34/50)
- Aliev: 30/50 (Public leaderboard: 28/50)
- sravn: 29/50 (Public leaderboard: 25/50)
- usernam: 29/50 (Public leaderboard: 25/50)
The "public leaderboard" on Kaggle is visible to participants throughout the competition. To avoid data leakage, the data is not made public.
Since repeated evaluations on a single leaderboard (even if the questions are not made public) may indirectly leak information, Kaggle also provides a "private leaderboard" with questions of similar difficulty. It is used for a one - time evaluation of the models at the end of the competition to determine the final rankings.
Considering the significant increase in the difficulty of the questions compared to AIMO1, such scores are quite outstanding.
However, an interesting and crucial question remains to be answered: What kind of results will closed - source AI models achieve when they "take on" the AIMO competition questions?
In response, AIMO collaborated with OpenAI and others to conduct an experiment. An unreleased version of OpenAI's o3 model, o3 - preview, was applied to 50 Olympic - level mathematical questions on the public leaderboard of AIMO2.
This time, the general - purpose model o3 - preview was compared with the top 2 open - source models specifically optimized for mathematics in the AIMO2 competition.
In addition, a reference system called "AIMO2 - combined" was introduced this time:
The problem - solving results of the optimal models of more than 2000 Kaggle teams participating in the competition were combined. If at least one model solved a certain question, that question was considered solved.
In an absolute sense, without considering the limitations brought by computing power costs, on the AIMO benchmark, the high - computing - power version of o3 - preview is close to "saturation", even though it is a general - purpose model and not specifically optimized for mathematics.
This result is impressive and beyond expectations.
This shows that in terms of reasoning performance, there is still a significant gap between the strongest open - source models and the strongest closed - source models.
However, if the computing power cost is taken into account, the gap will significantly narrow.
On the 50 - question benchmark, the average cost per question for a single run of the low - computing - power version of o3 - preview is slightly less than $1.
This cost is higher than the cost of running all five winning models on a self - owned 8×H100 machine and is roughly equivalent to the cost of running a single winning model on a commercially rented 8×H100 GPU. Although it is difficult to make an exact price comparison, the order of magnitude of the costs is similar.
The combined score of the original top five models in AIMO2 was 38/50, 5 points behind the low - computing - power version of o3 - preview. This shows that on the premise of only adjusting the computing power and limiting it to 50 questions, the reasoning performance is roughly similar.
Next, the performance of o3 - preview, the performance of the champion and runner - up teams, and the overall performance of AIMO2 - combined will be summarized in turn.
With sufficient computing power, o3 almost gets a full score at once
AIMO ran o3 - preview under three different parameter settings: low computing power, medium computing power, and high computing power.
These settings affect both the internal thinking and reasoning levels of o3 - preview and bring different hardware costs.
It should be noted that conceptually, the low - computing - power and medium - computing - power versions correspond to the same basic model running under two different parameters.
The high - computing - power version also uses a learned scoring function to select the best answer.
This "sample - and - rank" mechanism at a fixed sampling rate brings better performance.
Similar to the Kaggle competition, the test was conducted under strict conditions to ensure that the test set on the public leaderboard remained free of data contamination and information leakage.
Each question was attempted only once.
The low - computing - power and medium - computing - power versions each returned one answer, while the high - computing - power version using the sample - and - rank mechanism returned several answers along with a score.
According to different computing - power versions, the scores of the OpenAI model are as follows:
- o3 - preview (high - computing - power version, counting the first - ranked and second - ranked answers): 50/50
- o3 - preview (high - computing - power version, counting only the first - ranked answer): 47/50
- o3 - preview (medium - computing - power version): 46/50
- o3 - preview (low - computing - power version): 43/50
Even if the AIMO2 champion model of NemoSkills was run on hardware stronger than that on Kaggle, the low - computing - power version of o3 - preview still solved 7 more questions.
The medium - computing - power version solved the same questions as the low - computing - power version and 3 additional questions, totaling 46/50.
The high - computing - power version scored 47/50 when only the first - ranked answer was counted; it was 50/50 when the second - ranked answer was also counted.
This shows that in principle, o3 - preview has the ability to generate correct answers for all 50 questions.
This result is comparable to the combined score of the best models of all more than 2000 Kaggle teams in AIMO2, which also solved 47/50 questions in total.
With 8 H100 GPUs, NVIDIA's AI only improves by 1 point
Previously, the teams that won the first and second places, NemoSkills and imagination - research, participated in the re - evaluation again.
To better understand the full potential of the models, the teams were allowed access to a machine with 8×H100 GPUs and a total of 640GB of video memory.
In AIMO2, to enable the models to run on the Kaggle platform, the competition imposed resource restrictions on the participating teams:
Each team was provided with 4 L4 GPUs, with a total of 96 GB of video memory (VRAM).
In this evaluation, the organizing committee removed the resource restrictions that the teams had to adapt to for Kaggle, allowing the models to fully demonstrate their capabilities on the 50 questions on the public leaderboard.
What were the final results?
· NemoSkills scored 35/50, an improvement from its 33/50 on the Kaggle public leaderboard;
· imagination - research also scored 35/50, an improvement from its 34/50 on the Kaggle public leaderboard.
The gap between open - source and closed - source models narrows again
However, caution is needed when reporting and comparing scores.
The 47/50 score of AIMO2 - combined is roughly similar to a "pass@2k +" type of score (i.e., "multi - sample pass rate"):
In more than 2000 attempts for each question, as long as there is at least one correct answer, it is considered passed, and no further ranking is done.
More generally, the commonly used "pass@n" type of score means that a (fixed black - box) model is queried n times. As long as the correct answer is included in these n outputs, the score is reported based on this (even if the model can run more times internally).
The model state is not allowed to be retained between multiple queries.
Of course, the models corresponding to more than 2000 submissions are not the same. Strictly speaking, "pass@n" requires the same underlying model, so "pass@2k +" is only an approximate score.
The low - computing - power and medium - computing - power scores of o3 - preview, as well as the 47/50 score of the high - computing - power version, all belong to the "pass@1" type of score.
Among the three computing - power levels, the 7 mathematical questions that the low - computing - power version of o3 - preview failed to solve include: 2 geometry questions, 2 algebra questions, and 3 combinatorics questions.
Although o3 - preview performed very strongly, a question named "RUNNER" (see the chart below) was particularly prominent:
This question was solved by NemoSkills, but the low - computing - power and medium - computing - power versions of o3 - preview failed to solve it, and the correct answer was only ranked second in the high - computing - power version.
On the contrary, another question, "EIGHTS", was solved with the first - ranked answer in the high - computing - power version.
This question was not solved by the top five models of A