HomeArticle

The gold medal in the Olympiad of Mathematics is just the prologue. OpenAI and Google have completely slapped the prophets in the face. The wave of AI is irresistible.

新智元2025-09-03 20:14
2022年,专家们笃定AI到2025年赢得IMO金牌概率不足10%。谁料短短3年,OpenAI和谷歌DeepMind的LLM双双摘金,不仅打破纪录,更宣告群体智能时代的到来!AI震撼人心,巨变势不可挡。

In 2022, prognosticators declared that by 2025, there was a 90% probability that AI would not win a gold medal at the International Mathematical Olympiad (IMO).

They spoke with such certainty and confidence.

However, just two years later, OpenAI and Google DeepMind shattered these pessimistic predictions:

Large Language Models (LLMs) not only won the gold medal ahead of schedule but also exceeded all expectations regarding the boundaries of AI capabilities.

From language generation to logical reasoning, and from general abilities to professional competitions, generative AI is crossing every “intellectual barrier” set by humans at an astonishing speed.

The more off - base the predictions were, the more shocking AI's performance became.

Now, it's almost certain that the development speed of AI far exceeds the mainstream expectations of the past few years.

The great transformation has just begun.

The Prognosticators' Collective Failure

Just recently, Ethan Mollick, a professor at the University of Pennsylvania's Wharton School and the co - director of the Generative Artificial Intelligence Lab, stated with certainty: In the past, people underestimated the development speed of AI.

He gave an example:

In 2022, the Forecasting Research Institute invited 169 top forecasting experts and scholars to evaluate the progress of AI.

At that time, they concluded that by 2025, there was only a 2.3% and 8.6% probability, respectively, that AI could win a gold medal at the International Mathematical Olympiad.

However, reality slapped them in the face: Google DeepMind's Gemini and OpenAI's ChatGPT, two general large - scale models, won the gold medals at the 2025 International Mathematical Olympiad.

Google DeepMind and OpenAI competed for the first “IMO gold medal” in AI history: OpenAI released its results first, but Google DeepMind's model results were officially certified by the IMO.

OpenAI took the lead and created a big hype, attracting a large amount of traffic:

It's reported that out of respect for the participating students, Google waited until the IMO officially certified the results before announcing them.

This is a historic moment for AI, marking the great progress of AI in the past decade.

Large language models, originally designed for language generation, have far exceeded most people's expectations in mathematics.

Noam Broen, a research scientist at OpenAI, believes that the predictions at that time were more pessimistic about LLMs:

It should be noted that these predictions were about “any” AI system winning an Olympiad gold medal. If it were about “large language models” - a type of general AI system - the probability in their eyes would be even lower.

Moreover, just before the release of the International Mathematical Olympiad results, MathArena tested the available large - scale models at that time, and none of them could win a bronze medal:

Soon after, the news came that AI had won a gold medal.

In mathematical reasoning, LLMs have been consistently underestimated.

The Forecasting Research Institute admitted that AI's performance in the International Mathematical Olympiad was amazing.

The inaccurate predictions are not accidental; the paradigm has changed.

In fact, in the three standard AI benchmark tests of MATH, MMLU, and QuALITY, the predictions almost completely failed.

In the MATH dataset benchmark test, GPT - 4 Turbo reached 87.82% in April 2024, while domain experts and super - forecasters thought the probability of reaching this level by June 30, 2024, was 21.4% and 9.3% respectively.

On MMLU, GPT - 4o and Claude 3.5 Sonnet reached 88.7% in mid - 2024, while the predicted probabilities were only 25.0% and 7.2%.

On the QuALITY Hard subset, RAPTOR + GPT - 4 scored 69.3 in June 2023 - a full year ahead of the deadline.

Both domain experts and super - forecasters misjudged the speed and direction of AI development.

Both groups underestimated the maximum computing power of AI by the end of 2024. The prediction of super - forecasters was only 1/5 of the actual maximum. Meanwhile, they overestimated the upper limit of machine - learning models:

Experts predicted that the parameter scale would reach 1.00E+14 (100 trillion),

Super - forecasters expected 4.00E+14 (400 trillion),

Both are ten times higher than the currently preliminarily confirmed parameter scale of 1.00E+13 (10 trillion).

Similarly, McKinsey released a report showing the predictions of a panel of artificial - intelligence experts in 2017 (before LLMs).

For example, McKinsey predicted that AI would reach the average human creativity level in 2037. But in fact, this goal was achieved in 2023.

Regarding the prediction of reaching the top 1/4 creativity level, McKinsey originally estimated it would take until 2055, but this goal has been achieved 30 years ahead of schedule.

Due to the development of generative artificial intelligence, technological performance is expected to reach the level equivalent to the human median faster than previously estimated and reach the top 25% of human capabilities in a wide range of abilities.

For another example, the McKinsey Global Institute (MGI) previously thought that in natural - language understanding, technology might reach the level equivalent to the human median as early as 2027. But in the new analysis, this time point has been advanced to 2023.

In its 2025 report, McKinsey stated that in the past two years, AI has made rapid progress, and many important AI innovations have emerged ⬇️.

Amazed by the rapid progress of AI in reality, netizen Aravind Sunda exclaimed:

The speed of change is crazy. What seemed impossible in 2022 is now within reach.

On November 30, 2022, ChatGPT was officially launched. Before that, generative models or GenAI mainly referred to image and video generative models, and OpenAI was still exploring the application scenarios of GPT.

So, ChatGPT might be the biggest variable, as netizen Mahaoo said:

Almost all predictions made before the emergence of ChatGPT and GPT - 4 were doomed to seriously underestimate the actual progress of AI. The appearance of these models allowed the outside world to truly see the potential and speed of AI for the first time.

However, LLMs have a phenomenon of jagged intelligence: they perform excellently in some aspects but terribly in others.