HomeArticle

Google beats OpenAI in a math challenge tougher than the IMO.

量子位2026-02-26 15:56
The latest collection of math problems recommended by Terence Tao for people to follow.

The IMO gold medal is already "outdated."

Google's mathematical agent Aletheia, based on Gemini 3 Deep Think, achieved the best results in the more challenging competition FirstProof.

In the published full score sheet, Aletheia solved 6 out of 10 questions without any human intervention throughout the process. Among them, 5 questions were unanimously approved by the experts, and one question received a 5/7 pass rate.

FirstProof is a set of mathematical question collections specifically designed to verify the independent scientific research ability of AI, jointly created by 11 top mathematicians from prestigious universities such as Harvard and Stanford.

There is no trace of the 10 questions on the entire network, so it's impossible to cheat by memorizing the answers. Even Terence Tao retweeted and said this matter is very interesting and recommended to pay attention.

Not only Google, but OpenAI's internal model also took this set of questions, and got 5 questions basically correct.

However! Google relied entirely on AI independently, while OpenAI used human intervention to select the best answers during the exam (doge).

Google has a slight edge

FirstProof was set by 11 top mathematicians from prestigious universities such as Harvard and Stanford.

Different from competition questions like IMO, the 10 questions in the latest challenge are not standardized competition questions, but are directly taken from the real difficult problems encountered by mathematicians and have never been publicly released before.

Moreover, the answers were only released after the AI finished the exam, which cuts off the possibility of AI memorizing answers and using templates.

First, look at the score sheet. OpenAI spent seven days in a sprint and got 5 questions basically correct, which are:

4. Inequality of the harmonic mean of finite additive convolution and Φₙ;

5. Geometric fixed - point criterion for O - adapted slice filtration and slice connectivity;

6. Large - scale ε - light vertex subsets;

9. Algebraic relations between scaled quadrilinear determinant tensors;

10. Kernelized CP–ALS sub - problem with missing data: Matrix - free PCG method based on Kronecker pre - conditioning.

Actually, initially, OpenAI's published score sheet had 6 questions. However, the second question (Determination of the non - zero property of the Rankin–Selberg integral of GLₙ over non - Archimedean local fields) was repeatedly pointed out by the community to have logical problems, so the team conservatively changed it to 5 questions.

However, the team revealed that during the testing process, they manually coordinated the communication between the model and ChatGPT for verification, format arrangement, and style adjustment.

For some individual questions, the final result was the best one selected manually.

On Google's Aletheia side, it independently solved all 6 questions, including the second question that OpenAI was questioned about.

In the expert review, it got unanimous approval from the experts on questions 2, 5, 7, 9, and 10.

Among them, question 7 is recognized as the most difficult question in this set of questions. It is an open unsolved problem. It was not until the standard answers were released in this FirstProof challenge that the Cappell–Weinberger–Yan team solved it for the first time.

Although question 8 did not get unanimous approval, it still got a high score of 5/7.

The corresponding questions are:

2. Determination of the non - zero property of the Rankin–Selberg integral of GLₙ over non - Archimedean local fields;

5. Geometric fixed - point criterion for O - adapted slice filtration and slice connectivity;

7. Realizability of the fundamental group of a compact manifold of a uniform lattice of a real semisimple group with 2 - torsion;

8. Existence of 4 - vertex Lagrangian smoothing of polyhedral Lagrangian surfaces;

9. Algebraic relations between scaled quadrilinear determinant tensors;

10. Kernelized CP–ALS sub - problem with missing data: Matrix - free PCG method based on Kronecker pre - conditioning.

In terms of the number of solved questions and the mode, Google's Aletheia not only solved one more question, but also had a slight edge by relying entirely on AI independently.

Next, let's continue to see how Aletheia actually works.

AI independently selects the best of two

First of all, the underlying model is the Gemini 3 Deep Think that previously won the IMO gold medal.

Aletheia is equipped with two versions (A and B) of the Gemini 3 Deep Think model, and selects the best one. (Version A is the latest version in February 2026, and version B is the version in January 2026.)

Then, it has a real 0 - human - intervention problem - solving process from reading the questions to submitting the answers.

Aletheia can directly read the original questions without human formatting and output the answers after independent reasoning.

Then, through the built - in verification and prompt extraction, it automatically checks the logical rigor and formats the answers, and finally directly outputs the answers in LaTeX format.

Moreover, the remaining 4 unsolved questions are not wrong, but Aletheia directly "refused to answer."

This is because of the intelligent screening mechanism. When Aletheia cannot generate a reliable proof, the model will not fabricate invalid answers, but directly output a reply of "no solution."

Aletheia can also dynamically adjust the allocation of reasoning resources. For example, when encountering the extremely difficult question 7, it can automatically invest far more reasoning computing power than for regular questions. Through multiple - round generation by the Generator sub - agent and strict verification by the Verifier sub - agent, it finally overcomes the problem.

For simple questions, it reasonably controls the computing power to avoid resource waste.

For example, when facing a numerical question like question 10 about tensor decomposition, Aletheia provided an efficient method for matrix - vector product calculation.

Instead of directly generating a large - dimensional Khatri - Rao product matrix Z, it dynamically generates the required rows, compressing the complexity of each iteration to O(qr + n²r), which is several orders of magnitude faster than the O(n³r³) of the traditional linear solver.

Google has a slight edge this time. The next set of questions will come in mid - March, and the difficulty will only be higher. Let's wait and see~

Reference links:

[1]https://x.com/lmthang/status/2021644542852968952

[2]https://mathstodon.xyz/@tao/116022211452443707

[3]https://x.com/polynoamial/status/2022527227049742779

This article is from the WeChat official account "QbitAI". Author: Focus on cutting - edge technology. Republished by 36Kr with authorization.