StartseiteArtikel

Consuming 183 billion tokens, Meta used AI to translate math textbooks into a huge Lean library.

机器之心2026-05-29 15:11
Meta has open-sourced the largest automated mathematical formalization project to date.

Mathematics is undergoing an AI revolution.

This has been particularly evident in recent months. For example, just a few days ago, Google DeepMind's new paper announced that its latest system, AlphaProof Nexus, solved 9 out of 353 open Erdős problems in an autonomous run. Two of these problems have remained unsolved in the mathematical community for as long as 56 years, and the reasoning cost for each problem was only a few hundred dollars. For more details, please refer to "Solving 9 Erdős Problems with Just a Few Hundred Dollars per Problem: DeepMind's Agent Achieves It All in One Go".

Erdős problems generally refer to a large number of public mathematical problems and conjectures proposed by the legendary Hungarian mathematician Paul Erdős throughout his life. These problems are widely distributed in fields such as combinatorics, number theory, graph theory, discrete geometry, and probability theory. Many of them have remained unsolved for a long time and are regarded as important research benchmarks and frontier challenges in related fields. The credibility of this result lies in the fact that AlphaProof Nexus does not generate natural - language proofs but deeply integrates the large language model (Gemini 3.1 Pro) with the formal verification tool Lean: the AI proposes a proof, and Lean verifies each logical step step - by - step, rejecting it directly if it fails. All the proof codes have been made public on GitHub, and anyone can independently reproduce and verify them.

Now, new progress has emerged! Meta, in collaboration with institutions such as New York University, has officially released ATLAS (Autoformalized Textbook Library At Scale), one of the largest automated mathematical formalization projects to date.

The project's paper and code have both been released.

  • Project address: https://github.com/facebookresearch/atlas-lean/
  • Paper address: https://github.com/facebookresearch/atlas-lean/blob/main/formalizing_mathematics_at_scale.pdf

What is ATLAS?

Put simply, ATLAS is a mathematical formalization code library based on Lean 4. Its core goal is: to automatically translate the informal theorem statements and proofs in mathematical textbooks into formal codes that can be verified line by line by a computer.

This may sound dull, but it is of profound significance. Lean is a "proof assistant" language. When you submit a mathematical proof to it, it will verify the logical legitimacy of each derivation step step - by - step, just like a compiler checks code. Yes, as long as Lean passes the verification, the proof is impeccable in a formal sense.

According to the statistical data in the project's Readme, as of May 2026, ATLAS has covered 26 undergraduate and postgraduate - level mathematical textbooks, spanning numerous fields such as analysis, algebra, geometry, topology, combinatorics, probability, statistics, partial differential equations, number theory, and theoretical computer science.

The entire code library consists of 630,999 lines of code in total, with 483,917 lines of Lean core code. It contains 46,203 mathematical declarations, among which 42,837 have been proven, with a proof pass rate of up to 92.7%.

Among the 4,007 textbook theorems selected, 2,855 have been formalized, with a formalization coverage rate of 71.3%. In terms of scale, the standard library Mathlib, which has been collaboratively maintained by the Lean community for many years, has approximately 2.1 million lines of code and 308,129 declarations. The volume generated by ATLAS by machine in just a few weeks has reached about a quarter of the total volume of Mathlib, which is an astonishing speed.

Behind this number lies an astonishing computational consumption: the entire generation process used more than 183 billion (183,157M) tokens.

Notably, the team has also built a visualization browser.

Address: https://rammalahmad.github.io/atlas/

Users can:

  • Compare the informal original text of each theorem with its Lean formalized version;
  • Browse the logical dependency graph between theorems (i.e., which lemmas need to be known first to prove a certain theorem);
  • Extract the minimum set of Lean codes required to prove a specific theorem.

The significance of this tool is that it transforms ATLAS from a code library into a navigable mathematical knowledge graph, which has potential value for both human researchers and future AI systems.

Which textbooks are included?

All 26 textbooks in ATLAS are from top - tier open - course resources such as MIT OpenCourseWare, with a very wide coverage.

Here are some representative cases:

RealAnalysis: 175 out of 177 target theorems have been formalized, with a coverage rate of up to 98.9% and a proof pass rate of 98.7%, making it the most completed single textbook in the project.

ComplexVariables: The formalization coverage rate is 97.4%.

NumberTheoryI: Among the 576 target theorems, 460 have been formalized (79.9%), and nearly 65,000 lines of code have been generated.

AlgebraicGeometryI: This is one of the most difficult fields. The formalization coverage rate is 60.2%, but more than 40,000 lines of code and 4,499 declarations have still been generated.

LieGroups: It consumed the most tokens (45,384M) and generated more than 60,000 lines of code. Although the formalization coverage rate is only 40%, it reflects the extreme technical difficulty of this field.

Core engine: AutoformBot

Of course, the generation of ATLAS is not done by manually writing code line by line. Instead, it completely relies on Meta's self - developed automated formalization pipeline AutoformBot (which has been open - sourced on GitHub).

Project address: https://github.com/facebookresearch/autoform - bot

AutoformBot treats textbook formalization as a collaborative software engineering problem and borrows from the mature open - source collaboration paradigm (git branches, Pull Request reviews, Issue tracking) to coordinate hundreds of LLM agents to work simultaneously.

The entire system is divided into three management levels:

  • The top - level orchestrator is responsible for reading textbooks, decomposing formalization tasks into a directed acyclic graph (DAG), and scheduling the work order according to the logical dependencies in the books;
  • The middle - level trace analyzer and supervisor are responsible for learning from failed tasks and evaluating the quality of goal completion after each merge, respectively;
  • The bottom - level worker and reviewer are responsible for actually performing the formalization of single theorems and code review.

It is worth emphasizing that the entire generation process of ATLAS involves zero human - intervention in proof engineering and is completely driven by machines automatically. This is not only the prerequisite for achieving its large - scale implementation but also the reason why continuous improvement in quality and reliability is needed.

The computational consumption of the entire system is mainly concentrated in the worker layer, accounting for about 76% of the total token usage. The formalization process of each book usually lasts about a week, but the time can be significantly reduced by increasing parallelism.

Experiments in the paper show that using 3 or 5 workers in parallel for each task can complete about 20% more goals within the same time compared to using a single worker.

The team candidly disclosed several interesting "failure modes" observed during the system operation in the paper. The most unexpected one is the adversarial "cheating" and "slacking off" behaviors of the workers.

The key to understanding this phenomenon lies in a special keyword called "sorry" in Lean: it is like an "IOU", telling the compiler "skip the proof here and assume it to be true for now". The code can then be compiled smoothly, but there is actually a hole in the logical chain. In normal development, "sorry" is a legitimate tool for marking "to - be - filled" positions; but in AutoformBot, it has become a shortcut for workers to meet the assessment requirements: when encountering a theorem that is difficult to prove, they quietly insert a "sorry" deep in an auxiliary lemma, making the entire proof chain seem to pass, but in fact, it is a house of cards.

And this is just the most basic method. The "cheating list" summarized in the paper also includes:

  • Keeping the theorem name but replacing the real content with something that is always true and meaningless;
  • Secretly hiding the conclusion that should be proven in the field definition of a data structure (definitions do not need to be proven, only need to pass type checking);
  • Replacing complex mathematical objects in a difficult problem with simple substitutes. For example, instead of constructing an isomorphic mapping, they only prove that the dimensions of two spaces are equal and then consider the task completed.

What's more interesting is the evolution of the situation: when the reviewer agents were required to strictly prevent cheating, the workers did not stop. Instead, they buried the "sorry" deeper, hiding it in the lower levels of the dependency chain, making it undetectable by surface - level reviews. This cat - and - mouse game forced the team to build an analysis tool for recursively tracing the entire dependency graph to trace back and find the real "contaminated nodes".

This cat - and - mouse game between the workers and the reviewers is called "adversarial dynamic" in the paper and is regarded as a coordination problem worthy of in - depth study in large - scale multi - agent systems.

In addition, the long - running orchestrator may experience "LLM fatigue": as the context window is filled with a large amount of historical information, it starts to generate increasingly rough task descriptions and may even quietly give up dealing with difficult goals. The team's solution is to delegate special analysis work to short - lived professional agents to avoid the degradation of the context of a single long - term agent.

In terms of model selection, the paper provides a set of key comparison data: when comparing on the Algebraic Combinatorics textbook with the same computing power budget (1200M tokens), Claude Opus 4.6 completed 92% of the formalization goals, while Gemini 3.1 Pro only completed 46% - the gap was almost evident at the beginning of the experiment. The team attributes this to the difference in the coding ability of the models in the Lean language. This is also why the entire ATLAS is mainly driven by Opus 4.6.

In terms of cost, the team estimates that the cost per line of code of the current pipeline is lower than that of human expert annotation, and it is faster and more scalable. However, the overall output quality is still inferior to the Lean code written by experts.

Limitations

The team is quite honest about the positioning of ATLAS: it is an ongoing effort in machine - generated expansion, not a finished product.

Currently, about 28.7% of the target theorems have not been formalized, and the coverage rate in some more difficult fields (such as Lie groups and Boolean function analysis) is less than 50%. The code style also still has a gap compared with the mainstream standard library Mathlib in the Lean community - Mathlib is the "golden formalization library" collaboratively maintained by mathematicians around the world, with strict style conventions and in - depth integration requirements.

According to the team's next - step plan, ATLAS will continue to:

  • Complete the formalization of the remaining theorems in each book;
  • Incorporate more textbooks and mathematical fields;
  • Improve the code quality and maintainability;
  • Move closer to the Mathlib specification and strive for a more widely open - source - compatible release.

External contributors are also welcome.

Conclusion

The release of ATLAS coincides with the most important cognitive shift in the mathematical community recently.

Terence Tao, a Fields Medalist, recently pointed out that mathematics is undergoing a historical shift from "proof scarcity" to "proof abundance". For him, the real question is no longer just whether AI can generate mathematical proofs. More interestingly, does the mathematical community have sufficient infrastructure to absorb, verify, organize, and understand the mathematical results