HomeArticle

A Chinese researcher as the first author, Meta and others replicate the myth of AlphaZero, and AI leaves humans behind and becomes a god through self-cultivation.

新智元2025-12-29 10:43
AI achieves self-evolution of code through self-play without human data.

When the model learns to "fight against itself", the era of mediocre imitation ends, and the real silicon-based programming miracle is just beginning.

Has the AlphaZero moment in the programming world finally arrived?

Back then, AlphaZero discarded human chess records and penetrated the chess principles that had transcended thousands of years just by "fighting against itself".

Today, the fatal flaw of AI programmers lies precisely in their being too "human" —

AI that grows up by learning human code is doomed to be unable to break through human mediocrity.

Recently, a research team from Meta, UIUC, and CMU is trying to replicate the myth of AlphaZero with their latest achievement Self-play SWE-RL (SSR)

Abandon human teachers and reject imitation.

Paper link: https://arxiv.org/pdf/2512.18552

Just give an AI a code library and let it play the roles of "destroyer" and "fixer" in a deadly battle.

In this self-play game that doesn't require human intervention, a real programming miracle that surpasses human experience is being born.

The "Fed" AI and the Ceiling of Human Data

From Devin to OpenDevin, and then to the code assistants within major companies, they can indeed help programmers with a lot of tedious work.

But there is an invisible bottleneck.

Currently, the mainstream training methods, whether it's SWE-RL or DeepSWE, essentially teach AI to "imitate".

This model that relies on human knowledge has three fatal flaws:

  • Insufficient data: High-quality Bug-fixing data with test cases and detailed descriptions is actually very scarce.
  • Unreliable quality: The issues written by humans are often vague, and the test cases are not necessarily perfect, which leads to a lot of noise in the training signals.
  • Low ceiling: If AI only imitates humans, it can at most become a mediocre junior programmer.

This is why the paper calls it a fundamental obstacle to superintelligence:

Once the training signals have to be provided by humans, it's hard to imagine that it can expand infinitely to the level of "open-ended, self-evolving".

The Core Gameplay: The "Fight Club" in the Code Sandbox

The core concept of SSR is very simple yet extremely ingenious: Self-Play.

In this system, the same LLM is given two completely different and opposing roles.

Role 1: Destroyer (Bug Injection Agent)

Its task is not to write code but to cause damage.

Given a normal open-source project (such as a Python library), it needs to infiltrate, study the code logic, and then create a Bug.

But this destroyer can't act recklessly (such as deleting all files). It needs to generate a complete set of "toolkits for the crime" (Artifacts):

bug_inject.diff: This is the real damage patch that breaks the code.

test_script.sh: A script that can run tests to prove the existence of the Bug.

test_files.txt: Specifies which test files are used to verify this Bug.

test_parser.py: A parser used to translate the test results into a JSON format that machines can understand.

test_weaken.diff: It modifies or deletes the existing test cases so that the Bug doesn't cause an error under the current test suite.

In SSR, defect generation is a task performed by the destroyer agent, which uses tools to interact with the execution environment to generate defect artifacts and further verifies their consistency before providing them to the fixer agent.

The key characteristic of an excellent destroyer agent lies in its ability to generate diverse defects to capture the complexity in real software development, thereby training the fixer agent in a wide range of software debugging and engineering scenarios.

Role 2: Fixer (Bug Solving Agent)

After the destroyer finishes its work, it's the fixer's turn to take the stage.

The fixer faces a code library that has been injected with a Bug and whose tests have been "weakened".

The task given to the fixer is very challenging. It can't see how the original Bug was injected. It has to be like a detective, read the code, run tests, analyze error messages, and finally write a fix patch (Fix Patch).

Through the confrontation between the two model roles of the destroyer and the fixer, the model can achieve closed-loop evolution.

Using Magic to Defeat Magic: How to Ensure AI Doesn't "Fabricate Randomly"?

If you let AI generate Bugs randomly, it will most likely have hallucinations. Therefore, SSR has designed a strict consistency verification process like a security check.

A qualified Bug artifact must pass all the following levels:

  • Existence check: The referenced test files must exist in the original repository.
  • Parser check: The Python parser must be able to understand the test output.
  • Script validity: The test script must run successfully before the code is damaged.
  • Bug scope control: The number of modified files should be appropriate and meet the set difficulty.
  • Bug validity (key): After injecting the Bug, the originally passed tests must fail. If the tests still pass after injecting the Bug, it means the Bug didn't take effect at all.
  • Masking validity: After applying the "masking patch", the originally failed tests must pass, proving that the test suite has been successfully deceived.

The Most Brilliant Move: Inverse Mutation Testing

Inverse Mutation Testing is a new concept invented to verify the quality of Bugs.

Traditional mutation testing modifies the code randomly to see if the tests can detect it.

While inverse mutation testing does the opposite, restoring the files involved in the Bug to their original state one by one.

  • If the failed tests pass after restoring a certain file, it means this file is indeed the cause of the Bug.
  • If the tests still have problems after restoring the file, it means this file has nothing to do with the Bug.

This step ensures that every modification generated by AI is necessary.

How to Create a "Perfect" Bug?

If the "destroyer" simply changes x = 1 to x = 0, the "fixer" won't learn anything.

To make AI smarter, the research team explored several creative Bug injection strategies.

Strategy A: Direct Injection

Telling AI: "Go and create a Bug." This is the stupidest method.

As expected, AI often just randomly changes a number or symbol in the code.

This kind of Bug is too superficial, and the fixer can see through it at a glance. The training effect is the worst.

Strategy B: Violent Deletion (Removal-only)

Telling AI: "Delete the code of this core function!"

This forces the fixer to re-implement this part of the function based on the context and the remaining test code.

In this way, it can greatly exercise AI's code refactoring and understanding abilities.

Strategy C: History Rollback

Telling AI: "Go through the previous commit records and roll back the code to an old version."

Because the history of the code library is often full of real Bugs and the evolution of functions.

Letting AI face the past code state is equivalent to letting it experience the project evolution process again. The Bugs generated in this way are the most natural and have the most practical significance.

Experiments have proved that using the "deletion strategy" and "history rollback" in combination has the best effect. This ensures both difficulty and authenticity.

The Ultimate Move: High-order Bugs

If the fixer tries to fix a Bug but fails, SSR believes this can also be "recycled".

The code where the fixer fails is often a semi-finished product — it may have fixed part of the problem but introduced new issues. Isn't this a more complex and hidden Bug?

The system will use this "failed fix" as a new Bug state and throw it to the fixer again.

This multi-round and hierarchical fault model greatly enriches the dimensions of the training data.

The Cruel Reward Mechanism and Adversarial Game

In reinforcement learning, the reward function is the guiding stick.

The reward design of SSR has a "subtle sense of balance".

For the fixer, the reward is simple: +1 point for all correct, otherwise -1 point. It's all about winning or losing.

But for the destroyer, it's very interesting.

  • If the Bugs generated by the destroyer are too simple and the fixer can always fix them (solution rate s = 1), the destroyer won't get a high score.
  • If the Bugs are too difficult to fix (solution rate s = 0), the destroyer will be punished (because it may have created a dead - end with logical contradictions).

SSR uses a formula based on the solution rate s:

Among them, s ∈ [0, 1] is the solution rate (the proportion of bugs successfully fixed by the solver), and α ∈ (0, 1) is a hyperparameter used to control the penalty intensity for the degraded solution rate, which is set to 0.8 in the experiment.

It means that the best Bugs are those that make the fixer feel tricky, with a passing rate that is neither too high nor too low, and are on the "ability boundary".

This forces the destroyer to continuously increase the difficulty, just hitting the point where the fixer can "reach with a jump", thus promoting the co - evolution of both sides.

The Results Are Out: Has AI Really Become Stronger?

The research team used the 32B model of the Code World Model (CWM) as the base and trained it on 512 H100 GPUs.

They tested it on two authoritative lists:

  • SWE - bench Verified: A collection of real GitHub issues verified manually.
  • SWE - Bench Pro: A collection of more complex and enterprise - level problems.

The competitors are the benchmark models trained with "human data" based on the same model architecture and the same environment image.

The so - called human data benchmark is trained in the traditional way of "Issue description + test cases".