32B Outperforms 671B: OProver, Fully Open - Source Math Theorem - Proving Model with M - A - P Architecture, Wins First in 3/5 Benchmark Evaluations

32B refreshes the SOTA of mathematical theorem proving

Formal theorem proving has always been recognized as the most rigorous reasoning touchstone for LLMs. Every step of the derivation must be verified by the Lean 4 kernel machine.

In the past two years, the open - source community has achieved continuous improvements in benchmarks such as MiniF2F and PutnamBench. However, the growth paths are becoming increasingly similar: expanding the model, expanding the data, and overlaying retrieval and multi - round correction during the deployment phase.

A key problem persists. Most of the retrieval signals, compiler feedback, and failure repairs are only integrated as external processes during deployment. The model does not systematically learn how to utilize these signals during the training phase, resulting in a "strategy misalignment" between training and deployment.

To address this challenge, the M - A - P open - source community and teams such as Nanjing University proposed OProver —

A Lean 4 theorem - proving framework that directly internalizes retrieval enhancement, compiler feedback, and multi - round repair into the training strategy.

In five Lean 4 whole - proof prover evaluations, OProver - 32B achieved three first - place and two second - place rankings:

It led LongCat - Flash - Prover w/ TIR in MiniF2F (93.3), ProverBench (58.2), and PutnamBench (11.3), and outperformed the 671B DeepSeek - Prover - V2 in all five evaluations.

The research team also open - sourced the OProofs corpus with 1.76M formal statements and 6.80M compiler - verified proofs, as well as 7 model weights of 8B/32B.

The code, weights, and training scripts have been fully open - sourced.

Strategy Misalignment: The Core Contradiction between Training and Deployment

In recent years, Lean 4 prover systems (such as Goedel - Prover - V2, DeepSeek - Prover - V2, and Kimina - Prover) have pushed the Pass@32 to a relatively high level on MiniF2F. Meanwhile, some work has started to introduce retrieval, compiler feedback, or self - correction.

The problem is that these signals are mainly used as enhancement processes during the deployment phase, attached externally to the already trained prover, rather than being incorporated into the learning objectives from the training phase.

This results in a misalignment:

During the training phase, the model mainly sees clear supervision pairs of theorem → verified proof.
During the deployment phase, the system provides the retrieved relevant proofs, previous failed attempts, and Lean compiler feedback to the model again, requiring multi - round repairs.

The core idea of OProver is to align the training objectives with the proof process during deployment. It enables the model to learn how to execute the agentic refinement loop during the training phase, incorporating multi - round correction, retrieval of relevant proofs, and compiler feedback as part of the training strategy, rather than an external wrapper during the deployment phase.

Lightweight and End - to - End Trainable

Deployment Phase: Finite - Round Repair Cycle

OProver models theorem proving as a finite - round repair cycle.

The strategy generates the next proof attempt based on the target formal statement, the top - k compiler - verified proofs retrieved from the memory bank, the previous proof attempt, and the diagnostic information returned by the Lean 4 compiler. If any round passes, the entire trajectory is considered successful.

Training Phase: Two - Stage Training

Continuous Pretraining (CPT): Pretrain on a mixed corpus of approximately 65B tokens, where about 30% comes from Lean 4 data in OProofs, 20% is code data (OpenCoder), 40% is mathematical corpus (Nemotron - Math - 4 - Plus), and 10% is long CoT data.
Iterative Post - training: Alternate between agentic proving rollout, SFT (based on round - level repair samples), and RL (based on the GSPO algorithm and a set of difficult problems).

The key design is that the retrieval results, failed attempts, and compiler feedback are no longer just external processes temporarily integrated during the deployment phase but are incorporated into the proof strategy that the model needs to learn.

Co - evolution of Data and Model

The OProofs corpus and the prover strategy promote each other during the iteration.

In each iteration, the new verified proofs generated by the current prover on the question bank are added to OProofs and indexed into the retrieval memory bank.

The repair trajectory becomes the training sample for the next round of SFT. The unsolved "difficult problem group" provides training signals for the next round of RL.

The data, training, and strategy form a continuously evolving closed - loop.

OProofs: Lean 4 Corpus for Agentic Prover

The research team also constructed and open - sourced OProofs, which contains approximately 1.76M formal statements and 6.80M compiler - verified proofs.

Among them, 4.29M proofs retain the retrieved relevant proof context, and 859K samples contain the Lean compiler feedback from previous failed attempts. The model not only sees "what the final correct proof is" but also learns "how to use the retrieval results and compiler feedback to continue the repair after a proof failure."

OProofs consists of two construction branches.

1. Reproving of Public Resources

Starting from public Lean resources such as NuminaMath - LEAN, Lean - Workbook, and Leanabell - Prover - FormalStmt, after cleaning and deduplication, proofs are regenerated and verified through agentic proving. Meanwhile, retrieval context, failed attempts, and repair trajectories are collected.

2. Natural Language to Formalization

Mathematical statements are mined from Common Crawl and GitHub, automatically formalized into Lean 4 by CriticLean, and then proofs are generated and verified through the agentic proving process.

In terms of coverage, OProofs spans multiple mathematical directions: 60.1% in algebra, 13.7% in analysis, 13.0% in number theory, and 6.8% in geometry. The difficulty distribution is mainly at the elementary (27.4%) and high - school (48.9%) levels, and also includes 18.9% of undergraduate - level and 4.8% of postgraduate - level problems.

Three First - Places and Two Second - Places in Five Evaluations

The research team evaluated OProver on five Lean 4 benchmarks: MiniF2F, MathOlympiad, ProofNet, ProverBench, and PutnamBench. By default, Pass@32 is reported, based on an unbiased estimate of n = 64 independent multi - round rollouts.

Within the scope of open - weight whole - proof provers, OProver - 32B has three key conclusions:

1. 32B Outperforms 671B Comprehensively

OProver - 32B outperforms DeepSeek - Prover - V2 (671B) in all five evaluations and leads LongCat - Flash - Prover w/ TIR (560B) in MiniF2F (93.3), ProverBench (58.2), and PutnamBench (11.3).

2. 8B Ties with 32B

OProver - 8B outperforms Goedel - Prover - V2 - 32B in all five benchmarks, with 4 times fewer parameters.

3. Iterative Post - training Brings Continuous Gains

On MiniF2F - Test, OProver - 8B increased from 79.5 to 91.8 (+12.3), and OProver - 32B increased from 84.7 to 93.3 (+8.6).

Ablation Experiment: Synergistic Contribution of Retrieval and Compiler Feedback

Removing multi - round compiler feedback leads to the largest decline: OProver - 32B dropped from 11.3 to 7.0 on PutnamBench and from 33.2 to 25.8 on ProofNet.

After further removing retrieval, the performance continued to decline to 5.9 and 24.7.

This shows that the improvement does not come from simple best - of - N sampling but from the synergy between retrieval - enhanced proof generation and multi - round repair guided by compiler feedback.

Among them, the Lean compiler feedback provides the main repair signal, and the retrieval context provides relevant proof structures and referenceable proof fragments.

Expansion during Testing: Stable Conversion with More Inference Budgets

As the inference budget increases from 8 to 256, OProver - 32B shows a stable improvement on all five benchmarks: from 87.5 to 92.8 on MiniF2F, from 15.5 to 22.0 on MathOlympiad, from 25.6 to 32.8 on ProofNet, from 51.3 to 56.9 on ProverBench, and from 6.4 to 11.3 on PutnamBench.

The optimal budget allocation is related to the benchmark difficulty. Most benchmarks prefer to increase the refinement depth, while difficult problems with low success rates like PutnamBench need to balance between repair depth and parallel exploration.

Open - Sourcing and Release

The research team open - sourced the OProver model, data, and training code, covering checkpoints at different training stages, the OProofs corpus, and the training pipeline.

• m - a - p/OProver - 32B / OProver - 8B — The final models

• m - a - p/OProver - 32B - Base / Round1 — Checkpoints at each stage of 32B

• m - a - p/OProver - 8B - Base / Round1 / Round2 — Checkpoints at each stage of 8B

• m - a - p/OProofs — 1.76M statements / 6.80M proofs / 1.06M trajectories

Of course, OProver currently mainly focuses on Lean 4 whole - proof proving.

What is worth observing in the future is whether this agentic refinement framework can be migrated to Coq, Isabelle, and engineering - level formal methods tools, and how long the performance improvement will last in a longer cycle of data - model co - evolution.

Paper: https://arxiv.org/abs/2605.17283

Code: https://github.com/multimodal - art - projection/OProver

Model and data: https://huggingface.co/collections/m - a - p/oprover

This article is from the WeChat official account "QbitAI", written by the OProver team and published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

32B outperforms 671B, OProver, a fully open-source mathematical theorem-proving model with the M-A-P architecture, secures first place in three out of five benchmark evaluations.