StartseiteArtikel

DeepMind's new paper stirs up a storm: An AI fully automated evolutionary algorithm that produces solutions even experts can't think of. Netizens: This might be the "ace card".

AI前线2026-02-27 17:30
The "last line of defense" of human programming is also being loosened by AI.

When it comes to AI Coding, many people used to have a "psychological comfort":

AI can only write "scaffolding code" and supplement front - end pages. When it comes to core algorithms and business logic, humans still have to step in.

However, this "last line of defense" is also loosening.

Google DeepMind recently did something even more remarkable: they made an LLM - driven agent directly rewrite and evolve the algorithm code itself — not just adjust parameters, but modify the algorithm logic.

After the modification, they put it into a real game environment and run it repeatedly, automatically evaluate, eliminate the inferior and select the superior, and let it evolve round by round.

So, what's the result? It actually created a brand - new multi - agent learning algorithm, which outperformed the versions hand - polished by human experts in multiple tests.

What's important is that these mechanisms are not intuitive and belong to solutions that are difficult for humans to exhaust through experience.

More crucially, humans only need to define the algorithm framework. After that, the search, modification, and screening are all automatically completed. There is no need to manually adjust parameters, no need for repeated trial - and - error, and no need for fine - tuning based on the researcher's intuition.

This agent is called AlphaEvolve, which continues DeepMind's consistent naming tradition of "Alpha" (AlphaGo, AlphaZero, AlphaFold). The word "Evolve" means "evolution", indicating its core mechanism: continuously rewriting and screening algorithms in a way similar to biological evolution.

This AlphaEvolve has existed since last year, but this is the first time it has been used to learn algorithms.

It combines the Gemini series of large models with evolutionary search, continuously generating, testing, screening, and re - evolving the code.

DeepMind wrote the research process and results into a 37 - page paper titled "Discovering Multiagent Learning Algorithms with Large Language Models", which caused a stir in the technology circle as soon as it was published.

After reading it, some netizens exclaimed that this thing is really "terrifying":

"This seems to be an ace in DeepMind's hand. I think it might enable Google to win the game."

Some people sharply commented:

"This is like teaching a child to read and then watching it write its own textbook."

Some people have already started thinking further: since AI can design better learning algorithms, perhaps it should also design a more perfect "ethical engine" for itself first and figure out the alignment issue before the real explosion of ASI.

Humans only select the algorithm framework, and AI evolves in a fully automatic closed - loop

Let's take a closer look at the experimental design and operation process.

It should be noted that the research team did not let the model "write algorithms from scratch". Instead, they selected two mature frameworks:

CFR (Regret Minimization): The CFR algorithm family relies on recursive definitions to accumulate regret values and construct average strategies.

PSRO (Policy Space Response Oracle): It continuously expands the policy population by iteratively calculating the optimal response and solving the meta - strategy.

In the past, in solving incomplete - information games (such as poker), although classic algorithms like CFR and PSRO have a solid theoretical foundation, the truly useful "upgraded versions" still had to be developed by human experts through experience - based parameter adjustment, rule modification, and trial - and - error.

Then, the researchers split the core logic of the algorithm into several rewritable Python functions, such as the regret accumulation rule, the current strategy generation method, the average strategy update rule, and the meta - solver logic of PSRO.

That is to say, they only opened the "key decision - making logic" for the LLM to modify, while keeping the rest of the framework fixed. This step is crucial, equivalent to defining the "gene range" for evolution.

Next, it enters the real "evolution stage".

AlphaEvolve takes the current algorithm code as an "individual", and the LLM generates several semantically meaningful rewritten versions: not random modifications, but modifications to specific logic, control flow, or update rules.

Each rewritten version will be automatically compiled, run, and then put into a group of game environments for real battles, and scored using indicators such as exploitability. The better - performing versions are retained as the basis for the next round of search, while the poorly - performing ones are directly eliminated.

The whole process is a closed - loop: generate → run → evaluate → screen → regenerate, advancing cyclically. Humans do not participate in intermediate parameter adjustment or manual screening, but only set the rules and evaluation criteria.

Caption: This schematic diagram was also created by AI.

As a result, AI evolved two brand - new algorithms.

Let's first look at the CFR school. AlphaEvolve evolved VAD - CFR.

AI did not just adjust small parameters. Instead, it directly modified the core logic such as "how to accumulate regret values, how to discount them, and when to start the average strategy".

For example, it introduced mechanisms such as volatility - sensitive discounting (dynamic discounting based on fluctuations) and hard warm - start schedule (accumulating strength in the early stage and exerting full force in the later stage).

It may sound abstract, but the effect is obvious: in multiple games, it outperformed the strongest versions hand - polished by humans so far.

This figure is very intuitive, showing the convergence performance of various CFR variants in different game environments. The upper part is the training games for the search stage, and the lower part is the larger - scale and more complex test games.

The horizontal axis represents the number of iterations (up to 1000 times), and the vertical axis represents exploitability (the lower the value, the closer to the equilibrium). The faster and lower the curve drops, the stronger the algorithm.

The gray line represents VAD - CFR. It can be seen that in most games, it drops faster and lower, clearly outperforming versions like CFR+, DCFR, and PCFR+ that have been optimized by humans for multiple rounds.

In some games, after about 500 iterations, the curve suddenly "steps on the accelerator", and the decline speed significantly increases — this is exactly the moment when its warm - up stage ends and it starts to exert its full strength.

It seems to be accumulating strength silently in the first half and truly sprinting in the second half.

More crucially, in larger - scale and more difficult test games, VAD - CFR still converges faster and has better results than artificially designed algorithms such as traditional CFR, CFR+, and DCFR, without the situation of "only being able to solve simulation questions".

This shows that it is not a small trick tailored for the training games, but has found a more efficient update method at the algorithm structure level.

Now let's look at the PSRO school: AI evolved the SHOR - PSRO algorithm.

What it does is very simple and bold: it redesigned the "meta - solver".

Traditional methods either focus on exploration or approximation of equilibrium, and the trade - off is fixed. SHOR directly mixes multiple update mechanisms together, designs a hybrid meta - solver, and dynamically adjusts it as the training progresses, allowing the training process to automatically transition from "diversity exploration" to "approximation of equilibrium".

This figure shows the comparison between it and classic methods such as Uniform, Nash, AlphaRank, PRD, and RM.

Different colors in the figure represent different meta - solvers: Uniform, Nash, AlphaRank, PRD, Regret Matching (RM), and the evolved SHOR (brown line).

The whole figure is divided into two parts. The upper part is the training games, and the lower part is the larger - scale and more complex test games, used to test the generalization ability of the algorithm.

The horizontal axis represents the number of PSRO iterations (up to 100 rounds), and the vertical axis represents exploitability (logarithmic scale); the lower the value, the closer the algorithm is to the game equilibrium and the better the performance.

It can be seen that in most games, the SHOR curve drops faster, and the exploitability at the 100th iteration is lower, indicating that it can approximate the equilibrium more effectively with the same number of iterations.

Especially in more complex test games (such as 4 - player Kuhn and 6 - sided Liar’s Dice), SHOR still maintains its advantage without obvious degradation.

In short, SHOR - PSRO is more flexible and smarter than traditional methods in deciding "when to explore more and when to focus on approximating the equilibrium".

It doesn't win by adjusting parameters, but by modifying the scheduling logic itself.

This article is from the WeChat public account "AI Frontline" (ID: ai - front), author: Muzi. It is published by 36Kr with authorization.