DeepMind's Return in Nature: AI Agent Develops Most Powerful RL Algorithm

Why can an agent autonomously discover RL algorithms?

One of the main goals of artificial intelligence (AI) is to design agents that can autonomously predict, act, and ultimately achieve goals in complex environments, just like humans. The training of agents is inseparable from reinforcement learning (RL), and related research has been ongoing for decades. However, the goal of enabling agents to autonomously develop efficient RL algorithms has always been difficult to achieve.

In response to this pain point, the Google DeepMind team proposed a method to autonomously discover RL rules through the interaction experiences of multiple generations of agents in different environments.

In large-scale experiments, DiscoRL not only outperformed all existing rules in the Atari benchmark test but also surpassed human-designed rules in challenging benchmark tests it had never encountered before, defeating several mainstream RL algorithms. The relevant research paper has been published in the authoritative scientific journal Nature.

Paper link: https://www.nature.com/articles/s41586-025-09761-x

This indicates that in the future, RL algorithms used to build advanced AI may no longer require human design but can be automatically discovered through the agents' own experiences.

Why Can Agents Autonomously Discover RL Algorithms?

According to the paper, their discovery method involves two types of optimization: agent optimization and meta-optimization.

The agent's parameters are optimized by updating its strategy and prediction to make it tend towards the goal generated by the RL rules. At the same time, the meta-parameters are optimized by updating the goals of the RL rules to maximize the agent's cumulative reward.

Figure | The entire process of an agent autonomously discovering an RL algorithm: (a) Discovery process: Multiple agents interact and train in parallel in different environments, following the learning rules defined by the meta-network; the meta-network is continuously optimized during this process to improve the overall performance; (b) Agent structure: Each agent outputs a strategy (π), observation prediction (y), action prediction (z), action value (q), and auxiliary strategy prediction (p), where the semantics of y and z are determined by the meta-network; (c) Meta-network structure: The meta-network receives the agent's output trajectory, environmental rewards, and termination signals, and generates goal predictions for the current and future time steps; the agent updates itself by minimizing the prediction error accordingly; (d) Meta-optimization process: The meta-gradient is calculated through backpropagation of the agent's update process to optimize the meta-parameters and maximize the agent's cumulative reward in the environment.

In terms of agent optimization, the research team used the Kullback–Leibler divergence to measure the gap between the two to ensure the stability and universality of the training process. The agent will output three types of results: strategy, observation prediction, and action prediction. The meta-network generates corresponding learning goals for them. The agent then updates itself according to these goals to gradually improve its strategy. At the same time, the model also introduces an auxiliary loss to optimize the predefined action value and strategy prediction, making the learning process more stable and efficient.

In terms of meta-optimization, the research team let multiple agents learn independently in different environments. The meta-network calculates the meta-gradient based on their overall performance and adjusts its own parameters. The agent's parameters are reset regularly to enable the learning rules to quickly improve performance within a limited time. The calculation of the meta-gradient combines the agent's update process with the optimization of the standard reinforcement learning goal, which is specifically completed by backpropagation and the Advantage Actor-Critic (A2C) algorithm, and evaluated with a value function dedicated to the meta-learning stage.

The Strongest RL Algorithm, Built by AI

To verify DiscoRL, the team used the interquartile mean (IQM) as a comprehensive performance indicator during the evaluation. This indicator is based on the standardized scores of multi-task benchmark tests and has been proven to be statistically reliable.

1. Atari Experiment

The Atari benchmark test is one of the most representative evaluation criteria in the field of reinforcement learning. To verify the algorithm's ability to discover automatically, the team meta-trained the Disco57 rules based on 57 Atari games and evaluated them in the same games.

When evaluating, a network architecture of a scale comparable to that of MuZero was used. The results showed that the IQM of Disco57 reached 13.86, surpassing all existing reinforcement learning rules, including MuZero and Dreamer, on the Atari benchmark. Moreover, it was significantly more efficient in actual operation (wall-clock efficiency) than the state-of-the-art MuZero.

Figure | Evaluation results of Disco57 in the Atari experiment. The horizontal axis represents the number of environmental interaction steps (in millions), and the vertical axis represents the IQM score in the benchmark test.

2. Generalization Ability

The research team further evaluated the generality of Disco57 by testing it on multiple independent benchmark tests it had never seen before. On 16 ProcGen 2D games, Disco57 outperformed all published methods, including MuZero and PPO. It also showed competitiveness in the Crafter benchmark test and won the third place in the NetHack NeurIPS 2021 Challenge without using any domain-specific knowledge. Compared with the IMPALA agent trained under the same settings, Disco57 was significantly more efficient. In addition, it also showed robustness under various settings such as network scale, replay ratio, and hyperparameter adjustment.

Figure | Evaluation results of Disco57 in ProcGen, Crafter, and NetHack NeurIPS.

3. Complexity and Diversity of the Environment

The research team discovered another RL rule, Disco103, based on three benchmarks, Atari, ProcGen, and DMLab-30, with a total of 103 environments.

Disco103 performed similarly to Disco57 on the Atari benchmark. In particular, it achieved human-level performance on the Crafter benchmark and approached the state-of-the-art performance of MuZero on Sokoban.

These results indicate that the more complex and diverse the environments used for discovery are, the more powerful and generalizable the discovered reinforcement learning rules will be, and they can maintain excellent performance even in environments they have never seen during the training process.

Figure | Comparison results of Disco103 and Disco57 in the same test. The blue line (Disco57) represents the rules discovered on the Atari benchmark, and the orange line (Disco103) represents the rules jointly discovered on the Atari, ProcGen, and DMLab-30 benchmarks.

4. High Efficiency and Stability

The research team evaluated multiple versions of Disco57. The optimal performance was discovered within approximately 600 million steps for each Atari game, which is equivalent to conducting three rounds of experiments on 57 Atari games. This is much more efficient than traditional human-designed RL rules, which often require more experimental times and a large amount of time from researchers.

In addition, as the number of Atari games used for the experiment increased, the performance of DiscoRL on the unseen ProcGen benchmark also improved, indicating that the discovered RL rules can be extended as the number and diversity of environments participating in the experiment increase. In other words, the performance of the discovered RL depends on the data (i.e., the environment) and the amount of computation.

Figure | The best rules of DiscoRL were discovered within approximately 600 million steps per game; as the number of training environments used for discovery increased, the performance of DiscoRL on the unseen ProcGen benchmark test also became stronger.

The research team said that in the future, the design of RL algorithms for advanced AI may be dominated by machines that can efficiently scale data and computing power, no longer requiring human design.

This discovery may be exciting but also raise concerns. On the one hand, it brings new potential in the academic field. On the other hand, current society is not ready to embrace this technology.

This article is from the WeChat official account “Academic Headlines” (ID: SciTouTiao), compiled by Xiaoxiao. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

DeepMind makes another appearance in Nature: An AI agent has created the most powerful RL algorithm.

Why Can Agents Autonomously Discover RL Algorithms?

The Strongest RL Algorithm, Built by AI