HomeArticle

Terence Tao personally tested it. GPT-5 Pro solved a three-year problem in 40 minutes and topped the most difficult math exam.

新智元2025-10-13 08:30
When mathematician Terence Tao presented a challenging geometry problem to GPT-5 Pro, within minutes, the screen lit up—with perfect reasoning and flawless logic, yet still no answer. In the same week, it won the top prize in the world's toughest math test. The scores were so dazzling they were almost blinding, yet they couldn't hide that momentary blank: Did it really understand anything?

Ten years ago, mathematician Terence Tao was still at the blackboard, deriving every geometric formula with his students.

Ten years later, he tossed the same problem to a machine - GPT - 5 Pro.

He wondered: Is AI just a faster calculator, or is it approaching true understanding?

A few minutes later, the screen lit up: Minkowski formula, Willmore inequality, volume integral... It wrote the entire reasoning into a perfect paper draft.

Looking at the string of results, Terence Tao was both shocked and a bit disheartened: The problem remained unsolved, just prettily dressed up.

That same week, another digital "mathematical mountaineering" was also underway.

GPT - 5 Pro scored a maximum of 13% on FrontierMath, the world's most difficult test set.

The score was eye - catching, but intuition failed. It was like a child prodigy good at calculation, but it still put down the pen in the face of real research.

So the question was no longer "Can AI solve problems?", but: How much of the world does it really understand?

Terence Tao's Actual Test

The "Three - Layer Performance" of AI in Scientific Research

Ten years ago, Terence Tao was still deriving geometric formulas with his students at the blackboard.

This mathematician, hailed as "the genius among geniuses", became the youngest Fields Medalist at the age of 21.

Ten years later, he decided to personally verify what this AI that "scored 13%" could actually do.

He didn't choose a standard question bank but brought it into a real scientific research setting - where there were no standard answers, only open - ended questions.

"I want to see if AI can come up with new ideas in areas I'm not good at." So, he posted this problem on MathOverflow:

If a smoothly embedded sphere in R³ has principal curvatures no greater than 1, is the volume it encloses at least as large as that of the unit sphere? - This is not my area of expertise (differential geometry), but I want to see if AI can offer new ideas.

This is a differential geometry problem. The two - dimensional case has long been supported by a theorem (Pestov–Ionin theorem), but the three - dimensional version remains unsolved to this day.

This difficult problem was proposed three years ago and no one has been able to solve it since.

Terence Tao wasn't testing AI; he was pushing it into a scientific research area without standard answers.

After interacting with ChatGPT continuously for about 40 minutes, he summarized: AI assistance is helpful at the micro and macro levels, but limited at the meso level.

Let's see how Terence Tao used AI to solve the problem again.

AI as a Computational Assistant

He first asked GPT - 5 Pro to handle the easiest "star - shaped" case.

Within a few minutes, AI generated a reasoning chain and automatically invoked three classic conclusions:

Minkowski integral formula: |Σ| = ∫Σ H s dA;

Willmore inequality: ∫Σ H² dA ≥ 4π;

Volume formula: vol(V) = ⅓ ∫Σ s dA.

Then it integrated them all into one sentence:

If |κ₁|, |κ₂| ≤ 1, then vol(V) ≥ (4π/3), that is, the volume of the unit sphere.

AI not only calculated correctly but also actively cited the Minkowski first integral formula he didn't mention and even added two proof routes.

Terence Tao wrote in a subsequent post:

It can complete all derivations based on the clues I provided. This part is almost impeccable.

At this stage, AI was like a perfect "mathematical engine" - it could derive, prove, and give examples, but it only shined in local tasks.

From Assistant to Mirror

He further probed it: If the surface is deformed and slightly deviates from a perfect sphere, can it still maintain stable reasoning?

AI quickly gave an answer - accurate and beautiful, but in the wrong direction.

Terence Tao wrote in his log:

It started to comply rather than question.

This is exactly the "mirror trap" of research - oriented AI: When the direction is wrong, it will whitewash the error and even make the error more "beautiful".

Although the problem wasn't solved, this experiment still gave Terence Tao new insights.

He realized that the real obstacle wasn't "nearly spherical" surfaces, but those extremely long, non - convex, sock - like surface structures - they can infinitely stretch the geometric scale but hardly increase the volume.

Terence Tao later summarized:

AI did make me understand the problem faster - not because it solved it, but because I saw why it couldn't solve it.

This sentence also became the starting point for all his subsequent AI experiments.

When GPT - 5 Climbed the "Everest" of Mathematics

A Summit with Only a 13% Success Rate

Meanwhile, during the days when Terence Tao was bringing AI into the scientific research setting, another "digital mountaineering competition" was also taking place.

At the beginning of October, research institution Epoch AI posted a tweet of less than 30 words - this time, it wasn't about an experiment, but an announcement of a summit on the "mathematical Everest".

Behind this information is one of the world's most difficult mathematical tests - FrontierMath Tier 4.

Epoch AI described it on its official website as a "research - level problem set". The difficulty of the questions is such that experts may take weeks or even months to make progress.

In other words, this tests "whether one can think", rather than "whether one can calculate".

From Gemini 2.5 to GPT - 5 Pro: A Three - Month Summit Competition

In July, Epoch AI first publicly launched FrontierMath Tier 4, calling it the "Everest of AI mathematical ability" - a research - level question bank designed specifically to test the ultimate reasoning ability of models.

At that time, no model could hold its ground in it.

In August, Google's Gemini 2.5 Pro took the lead:

We've just completed the initial evaluation of Gemini 2.5 Pro on FrontierMath. This time, we used the old - version reasoning scaffold, and the results are not final.

In September, they updated the scoring mechanism and introduced a "retry mechanism" - allowing AI to correct itself after a reasoning failure.

Everything seemed to be preparing for the decisive battle in October.

Just one day before Terence Tao was still "researching unsolved problems" with GPT - 5 Pro, Google's Gemini 2.5 Deep Think set a new record.

Epoch AI wrote:

We evaluated Gemini 2.5 Deep Think on FrontierMath. Since there's no API, we ran it manually. Result: A new record!

On October 11th, Epoch AI posted that tweet that caused a stir -

FrontierMath Tier 4: The ultimate showdown! GPT - 5 Pro set a new record (13%), answering one more question correctly than Gemini 2.5 Deep Think (but the difference is not statistically significant).

On the left is Grok 4 Heavy (about 5%), in the middle is Gemini 2.5 (about 12%), and on the far right, GPT - 5 Pro is slightly higher, stopping at 13%.

It answered one more question correctly than Gemini 2.5 Deep Think (but the difference is not statistically significant).

This means that although GPT - 5 Pro is temporarily "standing on the summit", it still has the distance of an entire mountain to go before it truly understands.

This tug - of - war is more like a draw, just that GPT - 5 reached the summit a few seconds earlier than Gemini 2.5.

Behind the High Score: A Victory of Algorithms or an Illusion?

This summit competition actually reveals another fact: AI's scores can improve, but its understanding remains limited.

And this problem was further magnified in Terence Tao's actual test.

The questions it won were mostly from well - structured, highly symbolic question types: algebra, linear systems, and basic analysis.

It made almost no progress in questions related to geometric construction, partial differential equations, non - convex spaces, etc.

Epoch AI itself knows that this is more like a "slight victory of algorithms" rather than a "mathematical breakthrough".

This high score was achieved through higher computing power, longer reasoning chains, and smarter prompts.

So the question becomes: When the score goes up, does the understanding also increase?

Maybe in the world of algorithms, it won; in the world of understanding, it hasn't even started.

When "Smartness" Has a Scale

The Boundaries of AI in Scientific Research

A few months later, he continued with another experiment - this time, it wasn't testing whether AI could solve problems, but testing himself: When everything can be automated, what are humans still thinking about?

I found that smartness also has a scale.

When he wrote this sentence, he remembered that unsolvable geometry problem. AI was perfect in every step but lost focus in the direction.

He finally understood - perhaps what really needs to be trained is ourselves.

He gave an example: A dependency - type matching tool called can let him instantly verify a line of proof, but when dozens of consecutive lines are completed by it, it's even harder for him to see the whole logical picture.

When the scale is further enlarged, the problem becomes more obvious.

When AI helps complete an entire paper or automatically compiles an entire textbook, the apparent "efficiency improvement" often means a degradation in the understanding of the structure.

The essence of mathematics lies in structure and connection - and the understanding of structure precisely requires "slow human thinking".

Terence Tao wrote in a subsequent post:

The optimal degree of automation is neither 0% nor 100%.

The truly efficient state is to leave room for humans at every level. If we let AI solve all