HomeArticle

LLM Talent Drain Case: Reinforcement Learning Geniuses Are Poached, and the Field Suddenly Becomes a “No-Man's Land”

新智元2025-08-04 15:21
AlphaStar and others have proven that reinforcement learning performs excellently in complex tasks such as games, far surpassing professional players! So why has reinforcement learning suddenly failed? How exactly did reinforcement learning go astray?

Recently, Joseph Suarez, a PhD in AI + CS from Stanford, published a historical review of reinforcement learning.

As a result, it went viral on X! Currently, it has had 382,000 views.

The cover is quite eye - catching: a curve first rises rapidly, then climbs gently, and finally takes a sharp turn downward, implying that the research future in the field of RL is not promising!

From a historical perspective, what has happened to reinforcement learning? Why has it only really started to take off now?

He provided a unique personal perspective.

From a Prestigious Academic Background

In 2019, he graduated from Stanford University with a bachelor's degree in computer science, specializing in artificial intelligence.

In 2018, during his sabbatical, he completed a six - month internship at OpenAI, and during this period, the first public version of Neural MMO was officially released.

Even earlier, he participated in research projects in Fei - Fei Li's research group and Andrew Ng's laboratory.

Around 2017, he started working on reinforcement learning.

At that time, he was pursuing his PhD in Phillip Isola's laboratory at the Massachusetts Institute of Technology and began to create the open - source computational research platform Neural MMO.

His research focuses on extending modern agent - based learning methods to more complex and cognitively realistic environments.

Later, this project became the theme of his entire doctoral dissertation.

Thesis link: https://jsuarez5341.github.io/static/jsuarez_phd_thesis.pdf

This also laid the foundation for his work on PufferLib.

At that time, major laboratories were also conducting reinforcement learning (RL) from scratch, non - language model - based.

In fact, this was the focus of most work at that time: multi - agent was just emerging, and all the core algorithms had just been released.

AlphaGo had shown researchers the potential of reinforcement learning. OpenAI Five was under development, and since he was interning at OpenAI at that time, he witnessed some of the work firsthand.

OpenAI's DoTA (Dota 2) project completely convinced him of the magic of RL.

Thesis link: https://cdn.openai.com/dota - 2.pdf

If you don't play this game, it's hard to imagine how complex this problem is.

You won't believe that people actually take playing DoTA as a hobby. It's not exactly the same as Go and can't be directly compared, but it does involve many types of reasoning related to the real world that are not present in Go.

For example, high - and low - level strategies, control, team coordination, and theory of mind are just a few examples.

OpenAI trained a network with 168 million parameters on about 1000 GPUs and defeated top professional players.

Now, you can do it with 64 to 128 H100 GPUs.

And there's more than one result. There are also AlphaStar, Capture the Flag, Emergent Tool Use...

During the training process, the evolution of the strategy of the agent (black dot) finally selected by AlphaStar to compete with the professional player MaNa and that of its competitors (colored dots). Each colored dot represents a competitor in the AlphaStar league.

In a short period of time, there were several major RL demonstration projects. So, since the potential is so obvious, the field will surely continue to move forward, right... right???

Why Did RL Decline?

From 2019 to 2022, some work continued, but reinforcement learning was clearly on a downward trend.

Although there were more papers during those years, there weren't many lasting breakthroughs of the same level as those from 2017 - 2019. What exactly happened?

The primary factor is academic myopia.

The entire field collectively established a set of standards without any real reason. Under these standards, it was almost impossible to make any progress.

Due to historical reasons, Agent57 became the most common benchmark, which includes a total of 57 Atari games.

Due to the large fluctuations in task results, all games need to be run (ideally, multiple seeds should be used for each game). At the same time, the academic community decided that the x - axis should be the number of samples rather than the actual running time (wall - clock time).

The idea behind this is that it is closer to real - world learning, and many problems are limited by the sampling rate. And you don't have to worry about the hardware settings in different papers.

However, the obvious problem is that there is no limit on the amount of hardware used, and the benchmark results can be improved by investing more computing resources. Therefore, research has become increasingly time - consuming, so that a single run of a single game may require several weeks of GPU time.

Since the academic community is averse to engineering, the codebase is also terribly slow. Not to mention the limited budget...

So, you ultimately need 10,000 GPU hours to run a set of ablations with a utilization rate of less than 5%.

This way of conducting research simply doesn't work and has nothing to do with good science.

If you don't have tens of thousands of hours of GPU computing power, many people simply publish papers without doing ablation experiments - no wonder the research results at that time were basically non - reproducible.

In addition, the academic community chases fame and fortune.

Large language models (LLMs) emerged.

People often ask him why he hates LLMs. He really doesn't. What he hates is that they have siphoned off 99% of the geniuses from other fields instead of a more reasonable 80%.

He watched as his most talented colleagues left the RL research field one by one and were hired to work on LLMs. It's hard to blame them. Doing RL is terrible. It's hard, cruel work, going against a set of standards that seem to be specifically designed to hinder real progress.

Basic things that you take for granted in general deep learning, even things from 2015, don't exist in RL.

The hyperparameters don't make sense, the models can't be scaled, and simple tasks can't be smoothly transferred.

Although they have evidence that RL can work on amazing problems like DoTA and Go, the daily work feels like despair.

Now RL Is Repeating the Same Mistakes

Slow experiment cycles, over - optimized evaluation systems, and sluggish development progress... Does all this sound familiar?

Modern RL research has somehow spent billions of dollars but has reproduced the chaos that initially stifled the development of RL, repeating the same mistakes.

David Peterson agrees very much: Reinforcement learning has inexplicably repeated the same mistakes multiple times, and the last time was with temporal difference.

This time it will go further, after all, there is profit to be made... but it's extremely inefficient.

It's quite ironic to see the field fall back into the dilemmas that predecessors overcame years ago while creating new terms for various concepts.

“Multi - round RL” means “not a bandit”. This covers almost all new RL research, except for some niche theoretical research.

“Long horizons” is not something new either, and this is not the whole picture of what makes the problem so difficult.

Joseph Suarez understands the current distrust of early RL research -

because there are indeed problems with much of the published content.

Looking for Another Way

Joseph Suarez still adheres to RL with small models from scratch.

Now, it's no longer the declining old force; they are making breakthroughs at an amazing speed.

So, what has changed?

After completing his doctorate, he decided to completely free himself from the arbitrary standards of the academic community and rebuild RL from scratch.

The standard is wall - clock training time, and performance engineering will be as important as algorithm work.

He spent several months dismantling all the slow infrastructure, aiming for a throughput of millions of steps per second instead of just a few thousand.

At first, this was just an accelerated version of existing methods. This was more than enough to solve problems in the industry that were difficult to implement due to high costs.

But it's not just that - this process actually allows them to conduct high - quality research at an unprecedented speed. When you can run 1000 times more experiments, you don't need an overly elaborate methodology; when all options can be tested, you don't need to carefully select variables.

The latest benchmark tests show that on a single RTX 5090, the training speed of the reinforcement learning library PufferLib 3.0 can reach up to 4 million steps per second.

A year ago, you needed a PhD in RL and weeks to months to handle each new problem. If you had no experience, it would take even longer. Now, novice programmers can get RL running on a new problem within a few days. Not super - difficult problems - those still require some experience. But it's much better than before.

Signs that they are on the right track: their experiments in simple environments can be generalized to more difficult ones.

They think the previous batch size and specific degenerate hyperparameters are to blame. Not 100% - there are definitely some techniques that only work on more difficult problems.

But now they have enough techniques that can run within a few minutes, and the development cycle is still very fast.

Next step: They plan to use existing things to solve valuable problems.

As long as a fast simulator can be built, RL can mostly work. Hey, in many problems, it works out - of - the - box.

In the long run, they will return to the old research on sample efficiency. But they will still approach it from the perspective of at least maintaining flop efficiency. No more running a 2 - million - parameter network with a batch size of 8 on a GPU with a utilization rate of less than 5%