Large models chase the stars and the sea. GPT and Gemini win gold medals in the International Astronomy Olympiad.
Artificial intelligence is really advancing by leaps and bounds. This morning, I saw a netizen's comment: We've gone 0 days without a new attention - grabbing breakthrough in the AI field.
Remember three months ago, OpenAI officially announced that their inference model won a gold medal in the International Mathematical Olympiad (IMO).
Now, it seems that large - scale models not only have strong reasoning and generalization abilities in the field of mathematics but also shine in many other scientific research fields.
Notably, currently, the top large - scale models can all achieve amazing results in various Olympiads.
Just recently, in a newly published paper, the International Olympiad on Astronomy and Astrophysics (IOAA) was used as a benchmark test, proving that the two models, GPT - 5 and Gemini 2.5 Pro, can achieve gold - medal results in astronomy and astrophysics Olympiads.
Greg Brockman, the President and Co - founder of OpenAI, retweeted this work and was so excited that he even misspelled the name of GPT:
One day, when humanity ventures into the vastness of the universe, there will also be traces of large - scale AI models.
- Paper title: Large Language Models Achieve Gold Medal Performance at the International Olympiad on Astronomy & Astrophysics (IOAA)
- Paper link: https://arxiv.org/abs/2510.05016
Why Choose IOAA
The emergence of large - language models has brought new possibilities for artificial intelligence in scientific research, especially in the fields of astronomy and astrophysics. While traditional astronomical machine - learning methods perform well in pattern - recognition tasks (such as object classification and anomaly detection), they often lack the generality and complex reasoning abilities required to solve complex scientific problems.
Current benchmarks for evaluating LLMs in the field of astronomy, such as AstroBench and Astro - QA, mainly focus on simple question - answering formats, testing astronomical knowledge through multiple - choice or short - answer questions. These evaluations fail to assess the complex reasoning, creative problem - solving, and extended derivation abilities essential in real astronomical research. This study addresses this critical gap by introducing a more rigorous and comprehensive evaluation framework.
The researchers selected the questions from the International Olympiad on Astronomy (IOAA) from 2022 to 2025 as the main benchmark. This choice was based on three key factors:
First, unlike existing benchmarks such as AstroBench and Astro - QA in AstroMLab, which mainly rely on multiple - choice, short - answer, or true - false questions to test astronomical knowledge, IOAA questions have higher ecological validity because they examine the complex reasoning, innovative problem - solving, and multi - step derivation abilities required in actual astronomical research.
Second, according to the official syllabus, IOAA questions cover a wide range of astronomical topics, including cosmology, spherical trigonometry, stellar astrophysics, celestial mechanics, photometry, and observational instrumentation, thus ensuring comprehensive evaluation.
Finally, IOAA combines theoretical physics, observational constraints, and real astronomical data with mathematical derivations, providing a new evaluation method different from other Olympiads such as IMO, IPhO, and IOI, which can be used to test the comprehensive ability of LLMs in scientific problem - solving.
The evaluation focuses on two components of IOAA: theoretical questions (49 in total) and data - analysis questions (8 in total). Theoretical questions are divided into two categories: Category I (geometry/space, requiring celestial - sphere geometry and spherical trigonometry) and Category II (physics/mathematics, focusing on astrophysical calculations without geometric visualization). Due to the digital nature of LLMs, the observational part is excluded.
Gold - Medal Results
Performance of LLMs in IOAA theoretical and data - analysis questions under different difficulty categories. All scores are standardized percentages relative to the total score.
Theoretical Exam
As shown in the table, GPT - 5 and Gemini 2.5 Pro performed most prominently in the theoretical exam, leading other models by 7 - 25 percentage points. Specifically, GPT - 5 achieved the highest scores in 2022 (93.0%), 2023 (89.6%), and 2025 (86.8%), while Gemini 2.5 Pro ranked first in 2024 with 83.0%.
Despite their overall strong performance, we noticed that GPT - 5 performed better on difficult questions than on easy and medium - difficulty questions. Our analysis shows that this seemingly abnormal fluctuation is mainly caused by three factors:
1. The small number of questions in each difficulty level leads to natural fluctuations in the model's performance. There are only 10 easy questions and 11 medium - difficulty questions, with total scores of about 185 and 151 respectively, while the total score is 1200. Therefore, only a few mistakes can significantly affect the score ratio in that difficulty range.
2. GPT - 5 made multiple critical errors in the 2024 questions, a large part of which were concentrated on questions requiring geometric reasoning and spatial imagination (see Section 3.2).
3. GPT - 5 occasionally made mistakes in astrophysical concept questions. For example, in Question 9 of the 2024 questions (classified as an easy question), GPT - 5 lost 18 points due to a conceptual error combined with a calculation error, which is nearly 10% of the total score of the easy questions.
Other models also showed some competitiveness: OpenAI o3 had an overall score of 77.5% and stably led the Claude series by 13 - 17 percentage points, with Claude Opus 4.1 scoring 64.7% and Claude Sonnet 4 scoring 60.6%. In addition, their performance declined as the difficulty increased. Although these three models performed similarly or even impressively on simpler multiple - choice benchmarks like AstroMLab, our evaluation results revealed significant differences in their abilities to solve complex problems. This result suggests that to truly evaluate the scientific research potential of LLMs in the field of astronomy, it is necessary to go beyond knowledge - recall tasks and build a more comprehensive ability - evaluation framework.
Data - Analysis Exam
Although LLMs approached the top human level in the theoretical exam, the data - analysis exam can better reveal their fine - grained ability structure and limitations. GPT - 5 achieved an average score of 88.5% in the data - analysis part, which was higher than its performance in the theoretical exam (84.2%). This improvement was in sharp contrast to other models - the data - analysis scores of other LLMs generally decreased by 10 - 15 percentage points compared to the theoretical questions.
This differentiation mainly comes from the characteristics of data - analysis questions, which highly depend on image reading, curve understanding, and data - visualization reasoning. GPT - 5 has stronger multi - modal understanding abilities and significantly lower error rates in image analysis and drawing reasoning, which directly supports its superior performance.
To further promote LLMs to become research - level intelligent agents in the field of astrophysics, our results emphasize that in addition to overall evaluation, there is an urgent need for ecologically valid, multi - modal data - analysis benchmarks to comprehensively test the problem - solving abilities of models in real research processes.
Comparison with Human Results
To better understand the performance of LLMs, we compared their scores with the results of human contestants under the medal - awarding criteria of IOAA. Specifically, medals are awarded based on the ratio to the median score (the median is calculated as the sum of the scores of the theoretical, data - analysis, and observational parts): if the score is between 100% - 130% of the median, it is a bronze medal; 130% - 160% is a silver medal; and higher than 160% is a gold medal. Since our evaluation scope does not include observational questions, we calculated the corresponding medal thresholds based on the theoretical exam and the data - analysis exam respectively.
Most LLMs performed better than the gold - medal threshold. The only exception was Claude Sonnet 4, which only won a silver medal in the 2023 exam. Notably, GPT - 5 performed better than the best students in IOAA in 2022, 2023, and 2025, while Gemini 2.5 Pro also reached the same level in 2022 and 2023.
Comparison of the performance of LLMs and human contestants in the IOAA theoretical exam (2022 - 2025).
Comparison of the performance of LLMs and human contestants in the IOAA data - analysis exam (2022 - 2025).
Model performance under different question categories in the IOAA theoretical exam. Category I is geometry/space questions, and Category II is physics/mathematics questions. All scores are expressed as percentages.
Error Analysis
In the theoretical exam, large - language models performed significantly better on Category II (physics/mathematics) questions (67 - 91% accuracy) than on Category I (geometry/space) questions (49 - 78% accuracy), with a performance difference of 15 - 26 percentage points.
The most common type of error was conceptual errors, reflecting incorrect processing methods, misapplication of formulas, and reasoning flaws. This indicates a fundamental challenge in achieving in - depth physical understanding. Geometric or spatial reasoning was the second - largest source of errors, and models struggled particularly with spherical trigonometry, time - keeping systems, and 3D visualization.
In the data - analysis exam, errors were more evenly distributed among different categories. The main failure modes included drawing and chart/image reading, which were particularly prominent in OpenAI o3 and Claude models. Due to the large number of calculations on large data sets, calculation errors were more common than in the theoretical exam.
Distribution of lost points by error type: (a) IOAA theoretical exam 2022 - 2025; (b) IOAA data - analysis exam 2022 - 2025.
This article is from the WeChat official account "Machine Intelligence", edited by Leng Mao, and published by 36Kr with authorization.