The father of AlphaGo has found a new way to create reinforcement learning algorithms: Let AI design itself.
Reinforcement learning is one of the hottest topics in the field of AI recently, and new algorithms are constantly emerging.
So, the question arises: Can AI discover powerful reinforcement learning algorithms on its own?
Recently, a paper published by the Google DeepMind team in Nature explored this possibility. Moreover, they obtained very positive results: The machine can indeed autonomously discover reinforcement learning rules with state-of-the-art (SOTA) performance, and its performance is superior to manually designed rules.
Title: Discovering state-of-the-art reinforcement learning algorithms
Address: https://www.nature.com/articles/s41586-025-09761-x
Notably, the leader and corresponding author of the team is David Silver, a leading researcher in the field of reinforcement learning. He also led the famous AlphaGo project and is often referred to as the "Father of AlphaGo". As of now, David Silver's citation count is nearly 270,000. There are four co-first authors in this study: Junhyuk Oh, Greg Farquhar, Iurii Kemaev, and Dan A. Calian.
Specifically in terms of methods, the team's idea is: Conduct meta-learning based on the experience accumulation of a large number of agents in a large number of complex environments. This method can discover the reinforcement learning rules that agents follow when updating strategies and making predictions.
The team also conducted large-scale experiments and found that this "automatically discovered rule" outperformed all existing methods in the classic Atari benchmark test and was also superior to various SOTA reinforcement learning algorithms in several high-difficulty benchmark tests it had never seen before.
This research result is of great significance. It means that in the future, the reinforcement learning algorithms required for advanced AI may no longer rely on manual design but can automatically emerge and evolve from the agents' own experiences.
Discovery Method
The team's discovery method involves two types of optimization: Agent optimization and Meta-optimization. The agent's parameters are optimized by updating its strategy and predictions to the targets generated by the reinforcement learning rules. Meanwhile, the meta-parameters of the reinforcement learning rules are optimized by updating their targets to maximize the agent's cumulative reward.
Agent Network
Many reinforcement learning studies consider what predictions an agent should make (e.g., value) and what loss functions should be used to learn these predictions (e.g., TD learning) and improve strategies (e.g., policy gradient).
Instead of using a manual design approach, the team defined an expressive prediction space without predefined semantics and used a meta-network for meta-learning to find out what the agent needs to optimize. They hope to support a broad space of novel algorithm possibilities while retaining the ability to represent the core ideas in existing reinforcement learning algorithms.
To this end, in addition to outputting the strategy π, the agent parameterized by θ also outputs two types of predictions: a vector prediction y(s) based on observations and a vector prediction z(s,a) based on actions, where s and a are observations and actions respectively (see the figure below).
The form of these predictions stems from the fundamental difference between "prediction" and "control". For example, the value function is usually divided into a state function v(s) (for prediction) and an action function q(s,a) (for control). Many other concepts in reinforcement learning, such as rewards and successor features, also have an observation-based version s↦ℝ^m and an action-based version s,a↦ℝ^m. Therefore, the functional form of this pair of predictions (y,z) is general enough to represent many existing basic concepts in RL but is not limited to them.
In addition to these predictions to be discovered, in most of our experiments, the agent also makes predictions with predefined semantics. Specifically, the agent generates an action-value function q(s,a) and an action-based auxiliary strategy prediction p(s,a). This is done to encourage the discovery process to focus more on discovering new concepts through y and z.
Meta-Network
A large part of modern reinforcement learning rules adopts the "forward view" of RL. In this view, RL rules receive a trajectory from time step t to t+n and use this information to update the agent's predictions or strategies. They usually update the predictions or strategies towards the "bootstrapping target", that is, towards the future predicted values.
Correspondingly, the team's RL rules use a meta-network (Figure 1c) as a function to determine the targets to which the agent should update its predictions and strategies. To generate the target at time step t, the meta-network receives a trajectory from time step t to t+n as input, which contains information about the agent's predictions, strategies, rewards, and whether the episode has terminated. It uses a standard LSTM to process these inputs, although other architectures can also be used.
The input and output selection of the meta-network retains some desirable characteristics of manually designed RL rules:
First, the meta-network can handle any observation information and any size of discrete action space. This is because it does not directly receive observations as input but indirectly obtains information through the agent's predictions. In addition, it processes action-specific inputs and outputs by sharing weights across different action dimensions. Therefore, it can generalize to very different environments.
Second, the meta-network is independent of the design of the agent network because it only receives the output of the agent network. As long as the agent network can produce the required form of output (π, y, z), the discovered RL rules can generalize to any agent architecture or scale.
Third, the search space defined by the meta-network includes the important algorithmic idea of "bootstrapping".
Fourth, since the meta-network processes both strategies and predictions, it can not only meta-learn auxiliary tasks but also directly use predictions to update strategies (e.g., to provide a baseline for reducing variance).
Finally, outputting targets in this way is more expressive than outputting a scalar loss function because it also includes semi-gradient methods such as Q-learning in the search space.
Based on inheriting these characteristics of standard RL algorithms, this parameter-rich neural network enables the discovered rules to implement algorithms with potentially much higher efficiency and more refined context awareness.
Agent Optimization
The agent's parameters (θ) are updated to minimize the distance between its predictions and strategies and the targets from the meta-network. The agent's loss function can be expressed as:
where D(p,q) is a distance function between p and q. The team chose the KL divergence as the distance function because it is general enough and has been previously found to help simplify the problem in meta-optimization. Here, π_θ,y_θ,z_θ are the outputs of the agent network, and ̂π, ̂y,ẑ are the outputs of the meta-network, and each vector is normalized using the softmax function.
The auxiliary loss L_aux is used for the predictions with predefined semantics, namely the action value (q) and the auxiliary strategy prediction (p), as follows:
where ̂q is the action-value target from the Retrace algorithm and is projected onto a two-hot vector 2; and p̂=π_θ(s′) is the next-step strategy. To be consistent with other losses, the team also uses the KL divergence as the distance function D.
Meta-Optimization
The team's goal is to discover an RL rule (represented by a meta-network with meta-parameters η) that enables the agent to maximize the reward in various training environments. This discovery goal J(η) and its meta-gradient
can be expressed as:
where
represents an environment sampled from a distribution, and θ represents the agent's parameters induced by the initial parameter distribution and continuously evolving during the learning process using the RL rules.
is the expected discounted sum of rewards, which is the typical RL goal. The meta-parameters η are optimized using gradient ascent according to the above equation.
To estimate the meta-gradient, the team instantiated a cluster of agents in a set of sampled environments, and they learned according to the meta-network. To ensure that this approximation is close to the real distribution the team is interested in, the team used a large number of complex environments from challenging benchmarks. This is in contrast to previous work that focused on a few simple environments. Therefore, this discovery process will face a variety of RL challenges, such as reward sparsity, task length, and partial observability or randomness of the environment.
The parameters of each agent are reset regularly to encourage the update rules to make rapid learning progress within the limited lifetime of the agent. Similar to previous work on meta-gradient RL, the meta-gradient term
can be divided into two gradient terms using the chain rule:
and
. The first term can be understood as the gradient of the agent's update process, while the second term is the gradient of the standard RL goal.
To estimate the first term, the team iteratively updated the agent multiple times and backpropagated through the entire update process, as shown in Figure 1d. To make it tractable, the team used a sliding window to backpropagate 20 agent updates. Finally, to estimate the second term, the team used the advantage actor-critic (A2C) method. To estimate the advantage, the team trained a meta-value function, which is a value function only used in the discovery process.
Experimental Results
The team implemented the newly discovered method through a large cluster of agents in a set of complex environments.
The team named the discovered RL rule DiscoRL. In the evaluation, the team used the interquartile mean (IQM) of the normalized scores to measure the aggregated performance, and this benchmark consists of multiple tasks. IQM has been previously proven to be a statistically reliable indicator.
Atari
The Atari benchmark is one of the most studied benchmarks in the history of RL, consisting of 57 Atari 2600 games. They require complex strategies, planning, and long