Can computing power not save AI's intelligence? Google's new move ends the "random parrot" debate.
In sparse reward environments, traditional AI models often struggle to find incentives and have difficulty learning hierarchical thinking. Now, Google's team has enabled agents to learn "leapfrog thinking" by introducing a meta - controller to manipulate the internal residual flow of the model. This research reveals that a hierarchical decision - making mechanism similar to that of the human brain can spontaneously form within large models, providing a new training paradigm for AI in complex multi - step tasks.
Is the biggest "weakness" of AI agents insufficient computing power?
No, too few rewards and overly long paths are the real issues.
In long - sequence tasks with sparse rewards, traditional token - by - token exploration is like walking through a maze blindfolded: there are no road signs, no prompts, and you only know if you're right when you reach the end.
The result is an awkward reality: to make an agent perform complex tasks, an external planner often has to be attached to "guide" it.
Google's research takes a different approach: in the maze, the agent is required to step on a series of colored sub - goals in sequence, and only receive a reward if the entire process is error - free. This uses the harshest sparse rewards to force out true hierarchical decision - making abilities.
The real breakthrough is that they no longer only optimize the output but start to manipulate the "cognitive process" inside the model.
How agents efficiently explore in sparse reward environments
Traditional large models rely on token - by - token exploration. For complex tasks that require multiple correct steps to obtain rewards, due to the sparsity of rewards, agents find it difficult to complete long - sequence tasks that require hierarchical decision - making.
This is like asking a person to walk through a maze blindfolded. Only when they reach the end can they get feedback, with no guidance in between. No matter how many times the person tries, they can't find the exit.
This is why current large - model agents need an external planner to complete complex, multi - step tasks. What Google's research does is to make the agent visit a series of colored positions (sub - goals) in a specific order in the maze, and only receive a reward after completing the sequence perfectly.
Figure 1: The agent needs to walk through different colored squares in the maze in order.
This "combinatorial task" requires agents to master hierarchical problem - solving abilities, needing not only low - level motor control skills but also high - level temporal planning abilities.
This is like the task of a human carrying a water cup, which is equivalent to performing a series of coherent actions such as "picking up the cup → walking to the table → putting down the cup".
The "brain within the brain": AI self - discovers abstract actions
So how did Google's team solve the problems caused by sparse rewards?
The answer is the meta - controller.
The meta - controller can generate a series of simple internal controllers by receiving the residual flow of the base model.
Each controller corresponds to a temporally abstract action, each temporally abstract action corresponds to a timeline, and is accompanied by a termination condition. By combining multiple controllers over time, agents can achieve efficient exploration in new tasks.
Figure 2: The meta - controller guides the activation of the residual flow of the pre - trained autoregressive model.
Through self - supervised next - action prediction, the meta - controller discovers how to generate a sequence of simple internal controllers with sparse temporal changes.
In hierarchical structure tasks, each internal controller corresponds to a temporally abstract action, guiding the basic autoregressive model to achieve a meaningful primary goal.
Figure 3: The architecture of the meta - controller.
Through reinforcement learning, researchers found that the meta - controller can automatically identify meaningful behavior modules through variational inference, which is equivalent to unsupervised discovery of how to complete abstract actions.
With the meta - controller, when training a robot to make tea for someone, there's no need to manually code the task into multiple steps.
In addition, the meta - controller can perform dynamic temporal integration. It can control the duration of each abstract action step through a switch unit. And it can perform combinatorial generalization, recombining the learned abstract actions to solve new tasks.
Figure 4: The self - supervised meta - controller discovers temporally abstract actions in the pre - trained autoregressive model.
The switch patterns learned by the meta - controller can be perfectly aligned with real sub - goal switches, even though the model has never received sub - goal labels. This way of switching sub - goals according to the environment emerges spontaneously, indicating that a hierarchical structure similar to "options" has formed inside the model.
Internal reinforcement learning: a new training paradigm with several orders of magnitude of efficiency improvement
What's most surprising about this research is the internal reinforcement learning with the meta - controller. Different from traditional reinforcement learning, which fine - tunes in the original action space, internal reinforcement learning learns in the discovered abstract action space, significantly reducing the search space. In tasks requiring combinatorial generalization, the success rate of internal reinforcement learning is significantly higher than all baseline methods, including the previously most advanced hierarchical reinforcement learning method, CompILE.
Figure 5: The success rates of different reinforcement learning methods.
The reason why agents can learn a multi - step task with a greater probability is that with the meta - controller, the model implicitly learns to decompose long - sequence tasks into reusable sub - programs (such as "move to a certain color block"), which reduces the search space and makes rewards no longer sparse.
It's equivalent to reducing the dimensionality of the action space, compressing the high - dimensional residual flow space into a low - dimensional abstract space. Coupled with operations on the abstract time scale, it shortens the effective time span, making reward distribution at the abstract level more efficient.
The specific implementation of the "wake - sleep" training cycle
In a 2015 paper [2], Jürgen Schmidhuber proposed the theoretical framework of the "wake - sleep" training cycle.
Its core idea is to build an iterative, self - improving cycle, with two phases executed alternately, aiming to build an autonomous intelligent system capable of forming and utilizing temporal abstraction and planning abilities.
During the sleep phase, the agent reviews its past experiences (sequences of observations and actions) and trains an internal world model through self - supervised learning.
During the "wake" phase, the agent uses the internal representation of the world model learned in the "sleep" phase for reinforcement learning and planning to discover new and valuable behaviors. The new experience data obtained during the "wake" phase is added to the experience library for the next "sleep" phase to improve the world model.
Google's research can be seen as a specific implementation of the "wake - sleep" training cycle. The pre - training of the autoregressive base model corresponds to the sleep phase. The model is trained on a large amount of unlabeled behavior data with the goal of predicting the next token (here, the next action or observation).
This process is self - supervised learning. The model learns to infer the agent's potential goals (such as sub - goals) and forms a temporally abstract representation in its residual flow activation.
The wake phase is the meta - controller and the internal reinforcement learning it drives. It learns how to manipulate the internal residual flow activation of the base model (world model) to generate meaningful, long - lasting abstract actions (such as "go to the blue position").
This is equivalent to planning and control in the internal state space of the world model.
Figure 6: The importance of freezing the pre - trained autoregressive model when discovering temporally abstract actions.
As shown in Figure 6, only when the basic autoregressive model is frozen during the training of the meta - controller will the correct switch representations aligned with sub - goals emerge.
This finding strongly supports the phased iterative idea of the "wake - sleep" cycle: first, a high - quality, stable world model (base model) is established through pre - training.
Then, on this basis, internal reinforcement learning is driven by the meta - controller to learn control strategies.
If both are trained simultaneously (co - training), the model will converge to a degenerate solution and fail to discover meaningful temporal abstractions.
This confirms the theoretical superiority of phased, iterative training, which is in line with Jürgen Schmidhuber's cyclic training scheme of "sleep first (build the model), then wake (learn control)".
Ending the random parrot debate
In large - model research, critics have always argued that no matter how large the number of parameters, autoregressive models are just "random parrots" and have difficulty forming consistent temporal abstractions and plans.
This research shows that the training method of predicting the next word, when combined with a meta - controller, can induce hierarchical temporal abstractions, which is highly similar to the human problem - solving approach.
Solving multi - step tasks without relying on manual reward shaping is a key step towards autonomous agents capable of navigating complex, open - ended search spaces, where the definition of intermediate progress is often unknown.
Google's research marks a shift in AI research from simply optimizing model output to understanding and manipulating the internal cognitive processes of models. It provides a solid practical basis for developing general AI systems with true hierarchical reasoning abilities, indicating that imitating human sleep can enable efficient learning of complex time - series tasks.
Compared with interpretive methods such as sparse autoencoders (SAEs), the meta - controller has significant advantages. It directly reduces prediction errors through residual flow intervention, has internal memory, supports long - term interventions, and can discover interpretable, long - lasting intervention strategies.
The potential applications of this technology are extremely wide.
In robot control, it can enable robots to perform complex multi - step coordinated tasks; in mathematical reasoning, it can autonomously decompose complex problems into manageable reasoning steps; in scientific discovery, it can also enable agents to conduct efficient exploration and hypothesis testing in sparse reward environments.
The internal reinforcement learning paradigm proposed by Google is particularly suitable for scenarios requiring long - term planning and combinatorial reasoning, providing a new path for achieving truly general intelligent systems.
This article is from the WeChat official account "New Intelligence Yuan", author: New Intelligence Yuan. Republished by 36Kr with permission.