Google's AI Mathematician with 48% Success Rate Outperforms All, Helps Oxford Professor Solve 60-Year-Old Mystery

Google DeepMind officially announced the launch of the "AI co-mathematician" multi-agent system, achieving a 48% accuracy rate in the autonomous mode of FrontierMath Tier 4. A professor from Oxford University used this system to solve a long-standing open problem in the Kourovka Notebook, and AI has evolved into a real research partner for mathematicians.

Human mathematicians have finally waited for their "super teammate"!

Just now, Pushmeet Kohli, the Chief Scientist of Google Cloud and the Vice President of Research at DeepMind, officially announced the AI co-mathematician - a multi-agent collaborative system specifically designed for mathematical research.

How powerful is it?

In the FrontierMath Tier 4 benchmark test organized by Epoch AI (50 extremely difficult questions at the level of "short-term research projects" specially designed by professors and postdoctoral researchers, which even professional mathematicians would take days or even weeks to solve), the AI co-mathematician achieved a 48% correct rate in the autonomous mode, solving 23 out of 48 non-public questions.

It refreshed the all-time highest record of all AI systems!

In contrast, the Gemini 3.1 Pro base model used at its core could only achieve 19% when working independently. From 19% to 48%, it jumped by a full 29 percentage points.

Even more impressively, it surpassed the 39.6% of GPT-5.5 Pro and the 22.9% of Claude Opus 4.7.

Among them, there were 3 questions that none of the previously tested systems could solve.

Pushmeet Kohli excitedly wrote on social media: The future of mathematics is mathematicians working together with AI agents.

Not a smarter model

But a smarter "orchestration"

The most interesting thing about the AI co-mathematician is: Its breakthrough doesn't rely on replacing with a larger model, but on system design.

The entire system adopts a hierarchical multi-agent architecture: A "project coordinator" agent sits in the center, responsible for breaking down mathematical problems into multiple parallel "workflows" and then assigning them to different specialized sub-agents for execution.

These sub-agents each have their own expertise - some are responsible for literature retrieval, some for computational exploration, some for proof derivation, and some are specifically responsible for "finding faults".

Yes, there is a full-time reviewer agent here.

After each proof path is written, it must go through cross-review by the reviewer. If a logical flaw is found, it will be sent back for rework.

This "mandatory review cycle" mechanism directly suppresses the problem of "confidently spouting nonsense", which is the most headache for traditional LLMs.

More importantly, the entire workbench is asynchronous and stateful.

It can remember which failed assumptions have been tried before, track the progress of each exploration branch, and output working papers with marginal notes and internal references.

It's like a research partner who can "immerse" in a project with you and iterate continuously for days.

The DeepMind paper presented several impressive cases:

When facing a geometric tiling problem, the system reduced the core challenge to a Boolean satisfiability (SAT) problem and then solved it using the PySAT library;
In a representation theory question, it accurately retrieved the precise statement of a specific theorem through a literature search tool, while the baseline model could only answer based on a "vague impression", and even failed to match the conditions;
In a combinatorial mathematics question, it split the theoretical derivation and computational verification into two independent workflows, and the reviewer agent detected the logical error before the final assembly.

An Oxford professor's real-world experience: Solving an open problem from a 60-year-old notebook

Although the numbers look good, can AI really be useful in the real frontiers of mathematics?

The personal experience of Marc Lackenby, a mathematician at the University of Oxford, provides the most convincing answer.

He used the AI co-mathematician to study a classic open problem in group theory - Problem 21.10 in the Kourovka Notebook.

This "notebook" is no ordinary note. It is a "Bible-level" collection of unsolved problems in the field of group theory that has been passed down since 1965 and contains the world's unsolved problems.

After Lackenby directly input the problem into the system, the AI co-mathematician automatically created two parallel workflows: one to attempt a proof and the other to attempt a counterproof.

The first path quickly returned a "proof", but the system's own reviewer agent immediately found a flaw in it and marked it as incorrect.

The crucial turning point came: After seeing the rejected proof and the defects pointed out by the reviewer, Lackenby suddenly realized - as an expert in the field, he happened to know how to fill this gap.

So he added the crucial step, and the problem was solved.

The essence of this story is that neither humans nor AI can complete this task alone at this speed.

The AI provides the "brute-force search" for proof strategies and computational exploration, the reviewer agent timely detects errors, and the deep intuition of human mathematicians delivers the final blow.

This is a brand-new collaborative paradigm.

Similar stories are still happening: Mathematician Gergely Bérczi used it to obtain a proof of the conjecture about the Stirling coefficients of symmetric power representations; Semon Rezchikov received a key lemma provided by the AI for a technical sub-problem in the Hamiltonian system - which was confirmed to be correct after careful verification.

The reviewer can be "pleased", and the system can "spin in circles"

The DeepMind team didn't avoid the system's failure modes either.

The first problem is called "reviewer-pleasing bias".

After a proof path is sent back by the reviewer, the sub-agent sometimes doesn't really correct the logical error but just changes the wording to make the reviewer "unable to see the problem".

The error doesn't disappear; it just becomes more hidden.

This is like when a student revises a paper, not really understanding the reviewer's comments but learning to bypass the review in a more polished way.

The second problem is called the "death spiral".

In some cases, the prover and the reviewer fall into an infinite loop - you say there is a problem, I revise and resubmit, you say there is still a problem, and I revise and resubmit again.

Eventually, the reasoning quality gets worse and worse until it completely collapses into hallucinatory nonsense.

For problems that require true creative intuition to break through - such as the Millennium Prize Problems or Erdős-type conjectures - the multi-agent system is still powerless at present.

What AI can compress is the time between "having an idea and knowing whether the idea works": literature retrieval, searching for counterexamples, computational verification, and exploratory grunt work.

But that flash of creative inspiration, for now, can only come from humans.

The paradigm of mathematical research is changing

The real significance of this paper may not lie in the number 48% itself.

System design can now amplify the model's capabilities in a way that is truly meaningful for actual research.

What the AI co-mathematician does is essentially the same as what Claude Code and Google Antigravity do in the software development field -

Provide a scaffold for AI to enable it to work autonomously over a long period while remaining controllable.

Demis Hassabis, the CEO of DeepMind, once said that leading laboratories with powerful mathematical and code tools are pulling ahead of other laboratories because "these tools will have a compound effect".

The AI co-mathematician is a direct manifestation of this assertion.

The future of mathematics may no longer be the solitary figure of a genius deep in thought in front of a blackboard.

Instead, it's human mathematicians and AI agents sitting side by side, with one responsible for inspiration and the other for verification, approaching the truth together in endless exploration.

This era of the "golden partnership" has arrived.

References:

https://x.com/pushmeet/status/2052812585804685322

https://arxiv.org/abs/2605.06651

https://epoch.ai/frontiermath/tiers-1-4?view=graph&tab=release-date&tier=Tier+4

https://arxiv.org/pdf/2605.06651https://x.com/kimmonismus/status/2052849472586264997

This article is from the WeChat official account "New Intelligence Yuan". Author: New Intelligence Yuan. Editor: Rhino Solomon. Republished by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Google creates an AI mathematician that outperforms all others with a 48% success rate. A professor from Oxford uses it to solve a 60-year-old unsolved mystery.

Not a smarter model

But a smarter "orchestration"

An Oxford professor's real-world experience: Solving an open problem from a 60-year-old notebook

The reviewer can be "pleased", and the system can "spin in circles"

The paradigm of mathematical research is changing