DeepSeek V4 Achieves Mathematical Proofs with 500x Cost Advantage: Agent System Breaks Multiple Records

Produced by the team of Princeton University.

In the field of mathematics, I've recently been truly shocked by the outputs of AI.

On May 20, 2026, OpenAI announced that its internal reasoning model had successfully disproven the "Unit Distance Conjecture" proposed by mathematician Paul Erdős in 1946 - a core open problem that had puzzled the field of discrete geometry for nearly 80 years.

Timothy Gowers, a Fields Medalist, said about this: "If AI can prove the Unit Distance Conjecture, it would be a landmark event of epoch - making significance for the mathematical and scientific communities... If a high - quality mathematical paper like this were submitted by a human to the top academic journal Annals of Mathematics, I would unhesitatingly recommend its acceptance."

In the same month, another Fields Medalist, Terence Tao, gave a speech titled "New Mathematical Workflow" at Stanford University. He announced that he had given up on keeping up with all new proofs in real - time because the speed at which AI generates proofs has far exceeded human digestion capacity. His judgment was more straightforward: mathematics is moving from an era of "proof scarcity" to an era of "proof surplus", and the real bottleneck has shifted from "how to generate proofs" to "how to verify and understand proofs".

This statement is worth pausing to think about. AI is producing mathematical conclusions at a speed that humans can't keep up with, and it's becoming increasingly difficult for us to judge whether these conclusions are really valid. Mathematics, a discipline that values strictness the most, is facing an unprecedented verification crisis.

A possible solution is: Let AI verify AI's proofs.

This is exactly the core idea of "formal theorem proving". Lean is one of the most mainstream formal proof languages at present. It requires every logical step to be written in a machine - verifiable way. Once the compilation passes, the correctness of the proof is guaranteed by the compiler and does not depend on anyone's judgment. However, this also brings new challenges: it is much more technically difficult for AI to directly generate a complete proof that can pass the Lean compiler than to generate natural - language mathematical derivations.

Recently, a research team from Princeton University released a new paper, proposing an agent framework called Goedel - Architect. The core model they used is the domestic open - source large model DeepSeek - V4 - Flash.

Paper: Goedel - Architect: Streamlining Formal Theorem Proving with Blueprint Generation and Refinement

Paper link: https://arxiv.org/abs/2606.06468

First, let's look at the capabilities of this system.

In the field of formal theorem proving, there is a standard test set called PutnamBench, which contains 672 questions from the Putnam Undergraduate Mathematics Competition. The cost differences among different systems to solve these 672 questions are quite significant.

Previously, one of the most competitive open - source pipelines was Hilbert, powered by Google Gemini 2.5 Pro. Hilbert spent about $170,000 just on API call fees to complete these 672 questions. In contrast, Goedel - Architect spent $294 to complete the same evaluation. The difference between the two is about 500 times.

More notably, the pass rate of Goedel - Architect on PutnamBench (75.6%) is higher than that of Hilbert (70.0%). The new method is not only cheaper but also more effective.

The name of this system, "Goedel - Architect", pays tribute to Kurt Gödel, the famous mathematician who proved the fundamental limitations of mathematics. There is a deep historical connection between Princeton and Gödel: Gödel spent his later years at the Institute for Advanced Study in Princeton, and this research team is from the Princeton Language and Intelligence Center (PLI) at Princeton University.

The founding director of PLI is Sanjeev Arora, an authority in the field of computational complexity theory, who won the ACM Computing Award in 2011. He has long been committed to exploring the question of "whether AI can become a super - human mathematician". He publicly elaborated on this path at the Heidelberg Laureate Forum: using a formal system like Lean as an anchor to expose the "hallucination" tendency of AI, because the compiler will not accept a proof with logical loopholes.

Danqi Chen also co - leads this team. She is also from the Department of Computer Science at Princeton University, with over 90,000 citations on Google Scholar. She graduated from Tsinghua University for her undergraduate studies and obtained her Ph.D. from Stanford University, under the supervision of Christopher Manning. One of her most well - known early works was co - developing the dependency syntactic parsing algorithm underlying Google SyntaxNet with Manning. After joining Princeton, she has long focused on the training and adaptation of language models, knowledge representation and reasoning, and the large - scale construction of question - answering systems.

This team has previously released two generations of Goedel - Prover, an open - source model series specifically for formal theorem proving, with the pass rate on the MiniF2F benchmark increasing from the initial 60% to 90%. Goedel - Architect is their latest exploration in the dimension of "how to organize the proof process with an agent framework".

In the new research on Goedel - Architect, the key point lies in the "blueprint" concept.

The architecture of Goedel - Architect.

Imagine a large - scale construction project. Different engineering teams start work simultaneously, but only after having a complete construction drawing: which area should be built first, which structure depends on which foundation, and which parts can be advanced in parallel. Without a construction drawing, the efficiency of the project will be very low.

The same is true for formal theorem proving in essence. Many existing systems use the "recursive decomposition" method: when encountering a difficult problem, they break it down into smaller sub - goals and then recursively decompose them, forming a top - down tree structure. The problem with this method is that once a branch reaches a dead end, the work of the entire tree may be in vain, leading to an inefficient cycle.

Goedel - Architect does it differently. Before actually starting the proof, the system first generates a "blueprint": a directed acyclic graph that contains all the definitions and lemmas required to reach the final theorem, as well as the dependencies between them. Each node is a precisely stated lemma, and each edge indicates that "this lemma depends on that result". This graph is a global view of the entire proof strategy.

With the blueprint, the system can distribute each unproven node in the graph to the Lean provers for parallel processing. Each prover only sees the lemma it is responsible for and the upstream results it depends on, without being interfered by other information. After a round of parallel proofs, some nodes are successfully proven (marked in green), some fail (marked in blue), and some are proven to be false. That is, the lemma itself is false (marked in red).

Failure is not the end but a "diagnostic signal". This is the third core part of the entire framework: blueprint refinement.

When a lemma node cannot be proven, the system does not simply record the failure. The prover is required to write a structured "post - mortem analysis report", which contains three parts: a diagnosis of the failure cause (is the proposition itself wrong, or is the proof idea too difficult?), the strategies tried and the positions where they got stuck, and the proposed repair solutions.

The system designs two types of processing paths for failure modes.

The first type is "wrong proposition": if a lemma is proven to be false, the system will extract the specific reason for the counter - example and modify the statement of this node in the next iteration. There is a case in the paper: when dealing with a question from the Putnam competition in 1989, the blueprint proposed an auxiliary lemma about binary representation, "Multiplying a number by 2 is equivalent to appending a zero at the end of the binary expansion". The prover found a counter - example when n = 5: in Lean's mathematical library, binary bits are stored starting from the least significant bit, so multiplying by 2 actually adds a zero in front of the most significant bit, not at the end. A single - word difference makes the entire proposition wrong.

The system records this "wrong - direction" diagnosis, and in the next iteration, directly changes the lemma to the correct version and propagates this correction to all nodes that depend on this lemma.

The second type is "proof too difficult": if a lemma is logically true but the prover cannot complete the proof within the token budget, the system will ask it to write down "how to decompose this lemma if there are more steps". This decomposition suggestion is incorporated into the blueprint in the next iteration, splitting the original difficult node into several easily - handled sub - nodes. The paper introduces a question from the Putnam competition in 1985: a lemma about the roots of a fifth - degree polynomial. The prover thought it was correct but was unable to complete the proof, so it suggested classifying and discussing it in four cases. In the next iteration, this decomposition was accepted, and all sub - cases were successfully proven, thus solving the problem.

The nodes that have been successfully proven are retained in the iteration. The whole process is like a jigsaw puzzle that is being gradually completed. Each iteration continues to advance on the basis of the completed part, rather than starting from scratch.

The team tested Goedel - Architect on five benchmarks.

MiniF2F - test is the most mature high - school competition mathematics test set, containing 244 questions. Goedel - Architect solved 242 of them (99.2%) under pass@1, which is on par with the previously strongest open - source system. The remaining two questions (both are IMO difficult questions) were also solved with the help of natural - language proof assistance, making the team the first system to complete all 244 questions in MiniF2F - test.

The performance on PutnamBench has been mentioned above: a pass@1 pass rate of 75.6%, exceeding that of Hilbert (70.0%) running with a budget of $160,000. With the help of natural - language assistance, the pass rate further increased to 88.8% (597/672), and the total cost was still less than $1000.

On the newer competition questions, the system solved 4 out of 6 questions in IMO 2025 and 11 out of 12 questions in Putnam 2025. It's worth mentioning USAMO 2026: the question - setting time of this set of questions is later than the training cut - off date of all models, so the possibility of "memorizing the answers" can be excluded, making it a truly contamination - immune test. Goedel - Architect solved 3 out of 6 questions in it.

The system has an optional auxiliary mechanism: when generating the initial blueprint, a natural - language proof idea can be provided as a structural reference. This natural - language proof is generated by a model with larger parameters (such as Gemini 3.1 Pro), but it only serves as a "scaffolding": providing a high - level strategy framework, while the specific formal implementation is still completed by Goedel - Architect itself.

For most questions, this assistance is not necessary. However, for difficult questions with "non - local structures", such as those requiring cyclic summation, parity chain derivations, or abstract algebraic structures, deriving the dependency graph from the formal proposition alone will become a bottleneck. In these cases, the structural guidance provided by natural language is decisive. The team conducted a controlled experiment on 9 such difficult questions: without using natural - language assistance, each question was run 4 to 12 times, and none of them were successful; after adding the assistance, all of them were solved.

The paper conducted a controlled - variable comparative experiment, and the core conclusion is: The improvement comes from the pipeline design, not just a better model.

The team ported Hilbert (a system using the recursive decomposition strategy) to the same DeepSeek - V4 - Flash backbone and could only achieve 84.4% on MiniF2F, while Goedel - Architect achieved 99.2% on the same backbone. On a 200 - question subset of PutnamBench, the tool - integrated reasoning method with the same backbone achieved 54.5%, while Goedel - Architect achieved 76.0%, and it consumed fewer tokens per question.

The recursive decomposition strategy repeatedly tries on a single sub - goal and sometimes gets stuck in a dead - end cycle. The global blueprint strategy allows the system to "step back" and look at the whole graph. During the parallel attempt, the failure of any node can be fed back to the adjustment of the entire strategy without waiting for the entire recursive tree to finish.

The technical significance of this work is clear: an open - source framework with extremely low cost has reached the level that only expensive closed - source systems could previously achieve on the core benchmarks of formal theorem proving.

The value of the formal proof system lies in providing an infrastructure that makes AI's mathematical outputs "credible". When AI claims to have proven an important conjecture one day, the judgment of the Lean compiler is more certain than any peer review.

Now, Goedel - Architect has further reduced the access threshold of this infrastructure by about two orders of magnitude.

This article is from the WeChat official account "MachineHeart" (ID: almosthuman2014), author: AI Problem Solver. Republished by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

DeepSeek V4 does mathematical proofs with a 500x cost advantage: The agent system has refreshed multiple records