HomeArticle

Insider Exposed: OpenAI Model Admits It Can't Solve the Sixth Question, Three People Win IMO Gold Medals in Two Months

新智元2025-08-12 08:56
OpenAI's two-month breakthrough: AI wins a gold medal at the IMO, and general-purpose technology reduces hallucinations.

In just two months, OpenAI has enabled AI to leap from struggling with primary school math problems to achieving the gold medal level in the International Mathematical Olympiad (IMO). Behind this is a breakthrough in general AI technology.

Can OpenAI's ChatGPT really win a gold medal in the international math olympiad IMO? Or is it just OpenAI's self - indulgence? What's the real story behind it?

The core team of OpenAI's IMO gold medal, Alexander Wei, Noam Brown, and Sheryl Hsu, visited the Sequoia Training Data podcast and shared how they enabled AI to win an IMO gold medal in two months 🥇.

For example, not everyone within OpenAI was optimistic. One researcher even bet that the model wouldn't win, with odds as high as 2:1, but eventually gave up the bet because "he didn't want to affect morale".

From 1 to 5 a.m. on the day of the competition, Noam Brown took a short break while Alexander Wei frantically checked the proofs generated by the model 🙈.

They also explained how they decided whether the AI had won a gold medal. To score the results, they hired external IMO medalists. Each proof was scored by three medalists, and they reached a consensus on the correctness. In this way, they believed that the AI was indeed capable of winning an IMO gold medal.

They also revealed that the proofs were as unique as "alien languages" and had low readability. Although they had the ability to optimize them, they chose to publish the original output for the sake of transparency.

If you just want to quickly understand the highlights, first look at the key points below; if you want to read the behind - the - scenes story, please continue reading.

Key Points at a Glance

In just two months, this elite three - person team from OpenAI achieved a goal that the entire AI field had failed to reach for years - reaching the gold medal level in the difficult problems of the International Mathematical Olympiad.

This is one of the most important milestones on the road to Artificial Superintelligence (ASI).

What makes this breakthrough particularly remarkable is not just the AI's mathematical ability, but also the architecture behind it. This is a general technology used to extend test - time computation and handle tasks that are difficult to verify and far beyond the scope of competitive mathematics.

Just a year ago, AI could only perform short - term mathematical reasoning, lasting only one - tenth of a minute. Now, there are AI systems capable of continuous reasoning for up to 100 minutes.

The expectation for superintelligence is that as we extend the reasoning time to thousands or even hundreds of thousands of hours, we may be able to start solving the greatest unsolved problems in many fields such as mathematics and science for humanity.

The team also introduced their unique approach: on tasks that are difficult to verify, they use general reinforcement learning techniques instead of formal verification tools.

The new model demonstrates amazing self - awareness - it actively admits that it cannot solve the sixth problem, while revealing the huge gap between solving competitive problems and achieving real breakthroughs in mathematical research.

The sixth problem of IMO 2025 is the most difficult problem in this competition. The general idea is as follows:

Consider a 2025x2025 grid of unit squares. Matilda wants to place some rectangular tiles on this grid. The sizes of these tiles may vary, but each side of each tile must be aligned with the grid lines, and each unit square can be covered by at most one tile.

Determine the minimum number of tiles that Matilda needs to place to ensure that each row and each column of the grid has exactly one unit square that is not covered by any tile.

The key highlights are as follows:

(1) General technology outperforms specialized solutions.

(2) A small team can achieve great results: the core team consists of only 3 researchers, and they completed the work in a final two - month sprint.

(3) Self - awareness improves AI reliability: when faced with the most difficult problem, the model can admit that it cannot solve it, rather than outputting a seemingly reasonable but incorrect answer.

(4) Extended computation during testing facilitates in - depth reasoning: the key to the breakthrough lies in extending the reasoning computation time from seconds to hours, enabling the model to think more deeply about complex problems.

(5) The competition is the starting point, not the end.

A group photo of OpenAI shared by Sheryl Hsu (the woman in the middle of the first row)

A Two - Month Miracle

The International Mathematical Olympiad (IMO) is the world's top mathematics competition for high school students. The difficult problems are so challenging that human contestants have to train hard for many years.

Even the genius mathematician Terence Tao won a bronze medal in his first IMO at the age of 10. About two years later, he won a gold medal in his second IMO.

But this small team from OpenAI only took two months!

What's their secret weapon?

In the Sequoia Capital podcast "Training Data", host Sonya Huang revealed the truth: They used a technology called a "multi - agent system".

Simply put, it allows multiple AI "assistants" to work simultaneously, like a super - team with a division of labor.

This method enables their model to solve complex problems in a short time.

AI's performance in mathematics is truly amazing!

Just a few years ago, AI models were still struggling to solve primary school math problems.

In 2024, GSM8K was still used as a standard for evaluating models.

GSM8K, or Primary School Math 8K, is a dataset containing 8,500 high - quality, linguistically diverse primary school math word problems. Currently, the performance on this dataset has reached saturation: Claude 3 has an accuracy rate of 95%.

But in the field of mathematics, this was just a temporary phenomenon. Subsequently, the American Invitational Mathematics Examination (AIME) and then the United States of America Mathematical Olympiad (USAMO) emerged as AI math benchmark tests.

The math ranking of open - source models last year

Now, AI has also won a gold medal in the International Mathematical Olympiad.

AI has broken through all math benchmarks at an amazing speed.

AI May Awaken Self - Awareness and Dare to Say "I Don't Have an Answer"

Sometimes, AI will "daydream" and fabricate wrong answers while being "self - righteous" and overly confident.

This belongs to the "hallucination" problem of reasoning models.

But OpenAI's model is special - it can firmly say "I don't know" when it can't solve a problem.

For example, on the sixth problem of the IMO, the model chose not to take risks and instead admitted its limitations.

The new model significantly reduces the "hallucination" problem.

OpenAI researcher Noam Brown believes that AI is starting to shift towards self - aware reasoning:

In the past, mathematicians had to carefully check the problem - solving processes of models because early systems often quietly made mistakes in inequalities or inserted wrong steps, resulting in "hallucinated" answers.

In the absence of valid proofs, the newly updated IMO model tends to say "I'm not sure", which greatly reduces hidden errors.

This made netizen Causal Coder, who firmly believes in AGI, excitedly comment: "This is more important than winning a gold medal!"

Why? Because it avoids "hallucination" and makes AI more reliable.

A study in the journal "Nature" also supports this view: reducing wrong outputs is the key to AI progress.

This not only shines in math competitions but may also help us avoid detours in future scientific computations.

Mathematics Humbles Us, and AI Has a Long Way to Go

Although this progress is exciting, we are still far from solving the Millennium Prize Problems.

If we estimate that an IMO problem requires 1.5 hours of thinking, solving a Millennium - level problem would require extending the thinking time by thousands of times. There is still a long way to go.

GSM8K is primary school math, and a good student can solve it in a few seconds. Now, AI has progressed from a few seconds to the IMO level - a genius student spends an average of 1.5 hours per problem (4.5 hours for three IMO problems). However, researching mathematics requires these Olympiad geniuses to spend 1,500 hours after they grow up. So, there is still a thousand - fold gap between 1.5 hours and thousands of hours.

In the face of the Millennium Prize Problems, experts in the entire field have made little progress after a lifetime of effort. The depth of mathematics humbles us: there is still a long way to go from 1.5 hours to hundreds of thousands of hours of human thinking.

Currently, only the Poincaré conjecture among the seven Millennium Prize Problems has been solved.

It's Not Just Mathematics on the Road to General Intelligence

This breakthrough aims to develop general reasoning technology, not just for mathematics.

They extended the reasoning time from O(0.1 minutes) to O(100 minutes) in just over a year.

In addition to making progress in long - term reasoning and on tasks that are difficult to verify, this also involves extending parallel computing and using multi - agents.

In a multi - agent reinforcement learning (MARL) experiment, two opposing teams of agents compete against each other.

They designed a "reward function" ingeniously to enable AI to handle difficult - to - verify problems. The same method can also be applied to the Physics Olympiad, although the model is still unable to perform the experimental part.

The technologies they used in extending thinking time, handling difficult - to - verify tasks, and parallel computing are all general technologies. They plan to use them in other systems or have already been using them.

From the perspective of infrastructure, they basically used the same infrastructure as other projects.

There was nothing specially customized for the IMO.

They said that this method will be integrated into more OpenAI models in the future to comprehensively improve the reasoning ability, thereby building more powerful models and continuously