New work by Turing Award winner Sutton: Solve a major flaw in streaming reinforcement learning with a 1967 formula
At the end of 2024, a paper titled "Streaming Deep Reinforcement Learning Finally Works" (arXiv:2410.14606) sparked extensive discussions in the academic community. The authors are from the Mahmood team at the University of Alberta. They spent a great deal of space describing an embarrassing reality: Reinforcement learning, a method that should inherently "learn on the go," can hardly achieve this in the era of deep neural networks. As long as the replay buffer is removed and the batch size is set to 1, the training will collapse. They call this the "stream barrier".
The StreamX series of algorithms proposed in that paper managed to barely cross this barrier by carefully adjusting hyperparameters, sparse initialization, and various stabilization techniques.
However, less than a year and a half later, a member of the same research group, together with collaborators from the Openmind Research Institute, presented a completely different answer: The root cause of the stream barrier is not "insufficient data," but "choosing the wrong unit for the step size."
Paper title: Intentional Updates for Streaming Reinforcement Learning
Paper address: https://arxiv.org/pdf/2604.19033v1
Code repository: https://github.com/sharifnassab/Intentional_RL
How Big a Pit Did One Foot on the Gas Pedal Create?
Imagine you are learning to park a car in a garage. The coach tells you to "step on the gas pedal for 0.1 seconds" each time. The problem is that stepping on the gas pedal for 0.1 seconds can result in vastly different distances traveled by the car depending on whether it's going uphill, downhill, empty, or fully loaded. Sometimes, you're just one centimeter short of parking properly, and other times, you're 30 centimeters off and might even hit the wall.
The step size in traditional gradient learning does exactly the same thing: It specifies how much the parameters should move each time, but it has no control over how much the function output actually changes. In batch training, the errors of hundreds or thousands of samples are averaged out, diluting extreme cases, so the problem isn't obvious. However, in a "streaming" environment, there's only one sample per step, and there's no averaging. Once the gradient direction becomes unstable, the update amplitude will vary greatly - moving forward 30 centimeters one day and backward 50 centimeters the next. The learning process will collapse in severe oscillations.
This "overshooting and undershooting" phenomenon is particularly severe in reinforcement learning because the gradients at each time step not only vary in magnitude but also change direction rapidly.
Redefine "How Much Should Be Done in One Step"
Arsalan Sharifnassab from the Openmind Research Institute, along with Mohamed Elsayed, A. Rupam Mahmood, Richard Sutton, and others from the University of Alberta, recently proposed a different way of thinking in a paper: Instead of specifying how much the parameters should move, directly specify how much the function output should change.
This idea didn't come out of thin air. In 1967, Japanese scholars Nagumo and Noda proposed the "Normalized Least Mean Squares" (NLMS) algorithm in the field of adaptive filtering in their paper "A learning method for system identification". In essence, it also infers the step size from the expected output change rather than the other way around. However, that algorithm is only applicable to simple linear scenarios.
The researchers extended this idea to deep reinforcement learning. They call it "Intentional Updates": Before each update, first clarify "what I want to achieve in this step," and then infer the appropriate step size.
For value learning (i.e., predicting future rewards), the intention they defined is that after each update, the value prediction error of the current state should be reduced by a fixed proportion - for example, by 5%, no more and no less. For policy learning (i.e., optimizing decision-making behavior), the intention they defined is: The selection probability of the current action is only allowed to change by a "moderate" amount at each step.
Using the driving analogy: It's like a driver deciding "I want the car to move forward 20 centimeters" before each operation and then automatically calculating how deeply to step on the gas pedal based on the current road conditions (slope, load), rather than stepping on the gas pedal to the same depth every time and leaving it to chance.
The Turing Award Winner and His Puzzle
One of the authors of the paper is Richard S. Sutton - the 2024 Turing Award winner, widely regarded as the "father of modern reinforcement learning."
Sutton's status in the academic community is roughly equivalent to that of Feynman in physics: He not only proposed the two fundamental frameworks of modern reinforcement learning, temporal difference learning (TD learning) and policy gradient, but also co-authored the most authoritative textbook in this field, Reinforcement Learning: An Introduction (now in its second edition, available for free online). He and Andrew Barto shared the Turing Award in 2024, and the award citation reads "for laying the conceptual and algorithmic foundations of reinforcement learning."
After winning the award, Sutton didn't choose to retire. Instead, he invested the prize money in the Openmind Research Institute he founded, specifically to support young researchers who are willing to explore fundamental problems in an environment free from commercial pressure. This new paper is from this non-profit institution.
The first author of the paper, Sharifnassab, recently published the MetaOptimize framework at ICML 2025, researching how to automatically adjust the learning rate online. The focuses of the two projects are highly consistent: how to make the step size itself more intelligent.
Algorithm Details: Simpler Than Expected
The mathematical derivation of "Intentional Updates" isn't complicated. Its core formula can be described in one sentence: The step size is equal to the "expected output change" divided by the "actual influence of the gradient direction on the output."
In value learning, this "actual influence" is the norm of the gradient vector (equivalent to measuring how "steep" the current parameter region is): The steeper the place, the smaller the step size; the flatter the place, the larger the step size, thus ensuring that the impact of each update on the value function remains consistent.
In policy learning, the "expected change" is defined to be proportional to the advantage function: How much better the current action is than the average, the policy will move in that direction - normalizing the magnitude through a running average to ensure that the magnitude of the policy change remains stable within an interpretable range in the long run.
The researchers also combined this core idea with two engineering practices: RMSProp-style diagonal scaling (to handle the magnitude differences between different parameter dimensions) and eligibility traces (to help the reward signal propagate to past time steps).
Three complete algorithms were finally formed: Intentional TD (λ) for value prediction, Intentional Q (λ) for discrete action control, and Intentional Policy Gradient for continuous control.
Experimental Results: Matching SAC Without a GPU
The paper evaluated this method on multiple standard benchmarks, and the results were impressive.
In MuJoCo continuous control tasks (including complex simulated robots such as Ant, Humanoid, and HalfCheetah), the final performance of the new method, Intentional AC, in the streaming setting (batch size = 1, no replay buffer) approached or even matched that of SAC - an algorithm that uses a large batch replay buffer and is almost the current gold standard for continuous control tasks. In terms of computational cost, each update of Intentional AC requires only about 1/140 of the floating-point operations of a single SAC update.
In Atari and MinAtar discrete action games, the performance of Intentional Q-learning was also comparable to that of DQN using a replay buffer, and it completed all tasks with the same set of hyperparameters without the need for individual parameter tuning.
The researchers also specifically verified whether the "intention" was actually achieved: They measured the ratio of the actual update amount to the expected update amount. In the simplified setting without eligibility traces, the standard deviation of this ratio was only between 0.016 and 0.029, and the 99th percentile was within 1.07; this means that in most cases, the update indeed achieved "doing exactly what was promised."
In addition, a set of ablation experiments showed that after removing the RMSProp normalization or the σ term, the performance decreased but was still competitive, and this "intentional scaling" itself was the primary contributor, with other components being auxiliary.
There Are Still Problems
The "Intentional Updates" framework also demonstrated significant advantages in robustness. When the researchers removed the various stabilization auxiliary techniques (sparse initialization, reward scaling, input normalization, LayerNorm) that the StreamX method relies on one by one, the performance degradation of Intentional AC was significantly less than that of the original StreamAC, indicating that intentional scaling fundamentally reduces the dependence on external "crutches."
However, the paper also candidly acknowledged an unsolved problem: In policy learning, the step size depends on the currently sampled action, which implicitly assigns different "weights" to different actions and may change the expected direction of the policy gradient. In the Humanoid and HumanoidStandup tasks, by measuring the cosine similarity of the expected update direction, the researchers found that this bias was close to 0.96 (almost no effect) during the critical learning stage; however, in Ant-v4, the alignment dropped to a median of 0.63, indicating that the problem cannot always be ignored.
The authors pointed out that future research should find a step size selection strategy independent of actions to ensure that the "intention" remains unbiased in the expected sense. This is a clear task left for future researchers in this field.
Conclusion: Let AI Learn on the Go Like Humans
The current mainstream training paradigm for large models relies on the batch processing of massive amounts of data: feeding in all the text and code from the Internet and iterating repeatedly until amazing capabilities emerge. This approach has proven effective, but it is fundamentally "learn first, then use": Once the training is complete, the model is frozen and cannot be continuously updated from each subsequent actual interaction.
What streaming reinforcement learning pursues is a completely different learning mode: It doesn't rely on massive replay buffers or large GPU clusters. Each experience is immediately translated into parameter updates, which is continuous, inexpensive, and adaptive. This is closer to the real learning methods of humans and animals.
From the initial breakthrough in 2024 by Elsayed et al. when they "finally made it work" to the "Intentional Updates" principle proposed in this paper, streaming deep reinforcement learning is maturing at an unexpected speed. It won't replace large models trained in batches, but for robots, edge devices that need long-term online adaptation, and any application scenarios that cannot afford large-scale replay buffers and GPU clusters, this approach is becoming increasingly convincing.
The step size is not just a hyperparameter; it's a commitment of how much AI wants to do at each step. When this commitment finally becomes controllable, the learning process itself becomes stable.
This article is from the WeChat official account "MachineHeart" (ID: almosthuman2014). The author is someone who focuses on RL. It is published by 36Kr with authorization.