HomeArticle

Has NVIDIA's AI started to take over the entire project? SATLUTION's self-evolving codebase tops the SAT competition.

机器之心2025-09-11 11:59
Is the era of AI developing complex software approaching?

Is the era of AI developing complex software approaching?

In recent years, research represented by Google's AlphaEvolve has proven that AI agents can optimize algorithms through iteration and even outperform humans in some small, independent programming tasks. However, most of these works are limited to the "algorithm core" of a few hundred lines of code or a single file.

However, real-world software, such as a top-level SAT solver, is a large and complex system project, containing hundreds of files, a sophisticated compilation system, and countless interrelated modules. Manually building a championship-level solver not only requires extremely high domain knowledge but also has a decreasing input-output ratio.

To this end, researchers at NVIDIA Research proposed  SATLUTION, the first framework to extend the code evolution ability of LLMs from the "algorithm core" to the scale of a "complete codebase". SATLUTION can handle complex projects containing hundreds of files and tens of thousands of lines of C/C++ code and has achieved performance surpassing that of the human world champion in the Boolean satisfiability (SAT) problem, known as the "cornerstone of computational theory".

  • Paper title: Autonomous Code Evolution Meets NP-Completeness
  • Paper address: https://arxiv.org/pdf/2509.07367

The SATLUTION framework iteratively optimizes the codebase of the SAT solver directly by coordinating LLM agents under the guidance of strict correctness verification and distributed runtime feedback. It is worth mentioning that in this process, it also "self-evolves" its evolution strategies and rules synchronously.

Based on the codebase and benchmarks of the 2024 SAT Competition, the solver evolved by SATLUTION not only defeated the human-designed champion in the 2025 SAT Competition but also outperformed the champions of both the 2024 and 2025 competitions on the 2024 benchmark test set.

The amazing performance of SATLUTION in the 2025 SAT Competition benchmark test. The height of the bar chart in the figure represents the PAR-2 score (an indicator for measuring the performance of the solver, with a lower score being better). The left color-gradient columns are the solver family evolved by SATLUTION, and their scores are significantly lower than those of the 2025 competition champion (blue) and runner-up (green) designed by humans.

How does SATLUTION work?

SATLUTION is built around an LLM agent, a set of dynamic rule systems, and a strict verification and feedback loop.

Dual-agent architecture

The system is driven by two LLM agents working in coordination, implemented based on the Cursor environment and the Claude series of models.

Planning agent: Responsible for high-level strategy formulation. At the initial stage of the evolution cycle, it analyzes the solver codebase and its performance as the starting point and proposes potential modification directions. In subsequent cycles, it comprehensively considers the accumulated code changes, performance indicators, and historical failure records to formulate a new evolution plan for the next iteration.

Coding agent: Responsible for executing specific development tasks. It directly edits and implements the C/C++ solver codebase according to the blueprint of the planning agent. Its responsibilities also include managing auxiliary tasks, such as updating the configuration of the build system like Makefile, fixing compilation errors, and debugging functional or runtime errors.

Rule system: Guidance and constraints

The rule system is the key to ensuring the efficiency and stability of the evolution process. It provides the necessary guidance for the exploration of the agents and effectively reduces the attempts in invalid or wrong directions.

Before the evolution starts, the researchers set a set of static rules for the system, encoding basic domain knowledge and hard constraints. This includes: basic SAT heuristic algorithm principles, strict correctness requirements (such as generating DRAT proofs for unsolvable instances), a unified codebase directory structure specification, and a detailed evaluation protocol.

Experiments show that in the absence of this set of initial rules, the performance of the agents will decline significantly, and they are prone to making modifications deviating from the target.

A core feature of this framework is that the rulebase itself can evolve dynamically. After each evolution cycle ends, an analyzer will review the compilation errors, verification failures, and newly emerging failure modes during the process and automatically propose rule patches.

For example, the system can automatically add a new "prohibited code pattern" to the rulebase based on the experience of a failure, thus preventing the agents from repeating the same mistake in the future. This allows the rule system to evolve together with the solver code, continuously improving the overall efficiency and robustness of the framework.

Verification and evaluation process

To ensure the code quality and the correctness of the solution, each newly generated solver version must pass a strict process.

  • Two-stage verification

The first stage is compilation and basic function testing. The system will try to compile the new code and, after successful compilation, run it on a test set containing 115 simple CNF instances to capture basic problems such as compilation errors and segmentation faults.

The second stage is complete correctness verification. The solver that passes the first stage will be run on a larger benchmark test set with known results. For each result it outputs, the system will conduct a verification: if it reports "satisfiable" (SAT), it will verify whether the given assignment is correct; if it reports "unsatisfiable" (UNSAT), it will use an external checking tool to verify the validity of the generated DRAT proof.

Only the solvers that completely pass the verification of these two stages will be considered "correct" and enter the next step of performance evaluation.

  • Distributed evaluation and feedback

The solvers that pass the verification will be deployed on a cluster consisting of 800 CPU nodes and evaluated in parallel on the complete SAT Competition 2024 benchmark test set (containing 400 instances). This large-scale parallelism enables the entire evaluation process to be completed in about one hour, providing the agents with near-real-time performance feedback.

The feedback indicators are very detailed, including the number of solved SAT/UNSAT instances, the distribution of solved instances in different time periods, memory usage, and the PAR-2 score as the core driving indicator (an average runtime indicator that imposes a high time penalty on unsolved instances).

Experimental results

In the experiment of 70 evolution cycles, SATLUTION showed a clear and robust performance improvement trajectory.

According to the performance tracking chart (Figure 8) of the 2024 benchmark test set in the paper, in the first 5 - 10 iteration cycles, the system made rapid progress, mainly because it integrated the complementary advantages of multiple initial seed solvers.

Subsequently, the speed of performance improvement slowed down but continued, indicating that the agents began to handle more subtle and complex optimization problems.

At about the 50th iteration, the solver evolved by SATLUTION on the 2024 benchmark had begun to outperform the 2025 human-designed champion.

By the end of the 70th iteration, its performance had stably surpassed all the benchmark solvers used for comparison. The entire process showed a high degree of stability, and there was no serious performance decline due to the existence of verification safeguard measures.

The self-evolution performance curve of SATLUTION.

The total cost of the entire SATLUTION self-evolution experiment process was less than $20,000. In contrast, developing a competitive SAT solver by human experts usually requires continuous engineering input for months or even years, while SATLUTION achieved results surpassing the top human level within a few weeks.

For more details, please refer to the original paper.

This article is from the WeChat official account “Almost Human” (ID: almosthuman2014), author: AI enthusiast. Republished by 36Kr with authorization.