StartseiteArtikel

Nature hat die technischen Details des Google-Modells, das bei der IMO (Internationale Mathematik-Olympiade) Gold gewonnen hat, veröffentlicht. Das Kernteam besteht nur aus 10 Personen und hat in einem Jahr 80 Millionen mathematische Aufgaben für die KI programmiert, um sie zu trainieren.

量子位2025-11-13 16:55
Kernmitglied, Gewinner des IMO-Goldmedals, Miklós Horváth

The full technology of Google DeepMind's IMO gold medal model has been completely made public!

Following DeepMind's naming tradition, this time it's called: AlphaProof.

Still in the form of a publication in Nature, the full paper on AlphaProof was released, for the first time detailing the technical architecture and training methods behind it. It's worth mentioning that the self - taught chess - playing AlphaZero was also mentioned multiple times in this paper.

Author Tom Zahavy also took this opportunity to share some details of the development process:

The AlphaProof team was not very large. Most of the time, there were only about 10 people, and more joined as the IMO competition approached.

The core team member who really brought about the breakthrough was Miklós Horváth, an IMO gold medalist.

He came up with a method to create various variants of the problems the AI was dealing with and use them as initial states for the agent to train on.

Over the whole year, the team also explored various research ideas. Although many failed, the successful ones were integrated into the AlphaProof system and are now fully made public.

Playing Mathematical Proofs as a Game

The core idea of AlphaProof is actually quite straightforward: turn the mathematical proof process into a game that can be trained repeatedly.

The system built a reinforcement learning environment based on the Lean theorem prover. In this environment, each mathematical proposition is a new game level, and the AI needs to choose appropriate strategies (tactics) to advance the proof.

If a strategy succeeds, new sub - goals will be obtained; if all goals are completed, it means the proof is finished.

The paper reveals that AlphaProof uses a 3 - billion - parameter encoder - decoder transformer model as its "brain".

This proof network not only needs to understand the current proof state but also output two key pieces of information simultaneously:

One is to suggest which strategies to try next, and the other is to estimate how many more steps are needed to complete the proof.

This design allows the system to allocate computing resources more intelligently and prioritize exploring the most promising proof paths.

In terms of the search algorithm, AlphaProof adopted a tree search inspired by AlphaZero but made key improvements.

For example, it introduced an AND - OR tree structure to handle multiple independent sub - goals in the proof. When a proof needs to meet multiple conditions simultaneously, the system will break them down into independent sub - problems and tackle them separately. Additionally, a progressive sampling mechanism was added to allow the system to explore more diverse proof strategies on critical paths.

The biggest challenge in training AlphaProof was: where to get so many math problems?

They first pre - trained the model with about 300 billion tokens of code and mathematical text to let it understand the basic logical structure and mathematical language. Then they fine - tuned the model with about 300,000 manually written proofs from the Mathlib library to let it learn the syntax and proof techniques of Lean.

The real breakthrough came from the automatic formalization process. The team developed a special translation system based on Gemini 1.5 Pro, which can convert natural - language mathematical problems into a formal language that Lean can understand. Through repeated iteration and improvement, this system finally generated about 80 million formalized problems from about 1 million natural - language math problems, far exceeding all existing datasets.

The main reinforcement learning loop is the core of the entire training. The system will continuously try to prove or disprove these automatically generated propositions, and successful proofs will be used to update the neural network.

Even if the result of automatic formalization is not completely accurate, as long as it is a valid formal proposition, AlphaProof can learn from the process of trying to prove it.

The entire main training phase consumed about 80,000 TPU days of computing resources.

The core architecture diagram in the paper shows how the two learning loops of AlphaProof work together.

In the main reinforcement learning loop, about 1 million informal math problems are first processed by the formalization system and translated into about 80 million formalized problems that Lean can understand. The proof network, in conjunction with the tree search algorithm, continuously tries in the Lean environment. Whether a proof is successfully found, a disproof is found, or it times out, each attempt will generate experience data and feed it back to the learning system.

The reinforcement learning loop during testing shows a more refined adaptation mechanism.

When faced with a particularly difficult target problem, the variant generator will generate about 400,000 related variants around this problem, equivalent to creating a small dataset specifically for one problem.

These variants contain various mathematical intuitions: simplifying special cases, generalizing to a more general form, exploring similar structures, etc.

The system will start an independent AlphaZero - style learning process, specifically training on these variants to gradually accumulate the insights needed to solve the original problem. This mechanism can handle multiple target problems in parallel, with each problem having its own variant curriculum and dedicated learning process.

Temporary Breakthrough on the IMO Field

AlphaProof's performance at the 2024 IMO was amazing, and now more development details behind it are made public.

Facing IMO - level difficult problems, simply increasing the search time is often not enough. This is where the test - time reinforcement learning (TTRL) introduced earlier comes in handy. That is, generating a large number of related variant problems (such as simplified versions, generalized versions, analogous versions, etc.) and then specifically training an "expert" model to tackle the problem.

Take the first problem of the 2024 IMO as an example. This problem requires finding all real numbers α that satisfy specific divisibility properties. The variants generated by AlphaProof include: only considering the case of rational numbers, assuming that α satisfies stronger properties, proving that α must be close to an integer, etc. By training on these variants, the system gradually grasped the key to solving the original problem.

In the actual competition, AlphaProof successfully solved three problems in algebra and number theory (P1, P2, P6). Among them, P6 was the most difficult problem in the entire competition, and only 5 out of 609 contestants completely solved it.

The TTRL process for each problem required 2 - 3 days of computing time, far exceeding the 9 - hour limit for human contestants. However, considering that the most advanced AI systems before could hardly solve the simplest IMO problems, this achievement is quite remarkable.

Tom Zahavy mentioned in his recollection that during the competition, they determined through the partial proof system that they could only get a bronze medal, but the TTRL was still running in the background.

Three days later, when three complete proofs appeared one after another, they finally determined that they could get a gold medal, and the team celebrated excitedly.

Where is the Next Step for Mathematical AI?

After AlphaProof won the gold medal, Google DeepMind has opened up AlphaProof's capabilities to the scientific community. Researchers can apply for usage rights, and several mathematicians shared their experiences of trying out AlphaProof in Nature.

Alex Kontorovich, a mathematician from Rutgers University, found that AlphaProof is particularly good at finding counterexamples:

Every time it pointed out a problem with my statement, I could quickly figure out what assumption was missing and adjust the statement and try again. This back - and - forth iteration is crucial for getting the correct formalized statement.

Professor Talia Ringer from the University of Illinois asked her two doctoral students to each provide a lemma they found tricky. AlphaProof proved one of them within a minute, and disproved the other because there was a loophole in the definition.

She commented that "AlphaProof's tendency to find counterexamples may be its most surprisingly useful feature."

Of course, mathematicians also tested and found that AlphaProof has limitations.

Kevin Buzzard from Imperial College London encountered difficulties when trying to use it to translate the proof of Fermat's Last Theorem. He found that when the proof was full of "customized definitions", AlphaProof didn't work very well.

This also confirms the discovery of the AlphaProof team in the paper: the system performs well when dealing with concepts already in Mathlib but encounters bottlenecks when facing completely new definitions.

Tom Zahavy also shared his thoughts on the application of AI in the field of mathematics:

One of the major challenges AlphaProof faces is its dependence on the Lean theorem prover. Although Lean is powerful and has an active community, its continuous evolution creates an unstable environment for AlphaProof. This means that in the sub - fields of mathematics where Lean's advanced strategies are more mature, AlphaProof's performance is often better.

Another key issue is "data finiteness". The number and uniqueness of math problems are limited. For the reinforcement learning agent to truly be general, it needs to be able to generate its own problems. Although some success has been achieved in creating IMO - level problem variants, this direction needs further expansion.

Hinton pointed out in an interview in June this year that AI is likely to be much stronger than humans in mathematics in the future because it can instantly share knowledge and generate its own training data in a closed mathematical system.

The method of AlphaProof is a preview of this prediction.

Paper link: https://www.nature.com/articles/s41586 - 025 - 09833 - y

Reference links:

[1]https://www.tomzahavy.com/post/how - we - achieved - an - imo - medal - one - year - before - everyone - else

[2]https://www.nature.com/articles/d41586 - 025 - 03585 - 5

This article is from the WeChat official account "Quantum Bit". Author: Focus on cutting - edge technology. Republished by 36Kr with permission.