Der Fall des "Blutraubs" durch LLMs: Verstärkungslearning - Talente ausgegraben, Bereich wird plötzlich "Menschenleergebiet"

AlphaStar und andere Beispiele belegen, dass das Reinforcement Learning in komplexen Aufgaben wie Spielen hervorragend abschneidet und die professionellen Spieler bei weitem übertrifft! Wie kommt es dann plötzlich, dass das Reinforcement Learning nicht mehr funktioniert? Wie ist das Reinforcement Learning überhaupt auf den falschen Weg geraten?

Recently, Joseph Suarez, a doctoral candidate in AI + CS at Stanford, published a historical review of Reinforcement Learning.

The result has gone viral on 𝕏! Currently, it already has 382,000 views.

The title graphic is eye - catching: A curve first rises rapidly, then slowly, and finally drops suddenly, indicating an unfortunate future prognosis for RL research!

What has happened in Reinforcement Learning from a historical perspective? Why has it only really started to take off now?

He offers a unique personal perspective.

Taught by Renowned Teachers

In 2019, he completed his bachelor's degree in Computer Science with a focus on Artificial Intelligence at Stanford University.

In 2018, he completed a six - month internship at OpenAI during his study break. During this time, he officially released the first public version of Neural MMO.

Even earlier, he participated in research projects in Fei - Fei Li's research group and in Andrew Ng's laboratory.

He has been involved in Reinforcement Learning since around 2017.

At that time, while pursuing his doctorate in a laboratory of Phillip Isola at the Massachusetts Institute of Technology, he began to develop the open - source computing research platform Neural MMO.

His research focuses on extending modern agent - based learning methods to more complex and cognitively more realistic environments.

Later, this project became the topic of his entire doctoral dissertation.

Link to the dissertation: https://jsuarez5341.github.io/static/jsuarez_phd_thesis.pdf

This also laid the foundation for his work on PufferLib.

At that time, various laboratories were also working on Reinforcement Learning (RL) from scratch, without language models.

In fact, this was the focus of most works at that time: Multi - agent systems (Multiagent) were just emerging, and all core algorithms were just being published.

AlphaGo made researchers recognize the potential of Reinforcement Learning. OpenAI Five was in development, and since he was interning at OpenAI at that time, he witnessed some works firsthand.

OpenAI's DoTA (Dota 2) project completely convinced him of the magic of RL.

Link to the dissertation: https://cdn.openai.com/dota-2.pdf

If you don't play this game, it's hard to imagine how complex this problem is.

You'll hardly believe that people consider DoTA a hobby. It's not exactly the same as Go and can't be directly compared, but it actually involves many types of reasoning that don't occur in Go and are related to the real world.

For example, high - and low - level strategies, control, team coordination, and Theory of Mind. These are just a few examples.

And OpenAI defeated the best professional players with a network of 168 million parameters trained on about 1,000 GPUs.

Today, you can also achieve this with 64 to 128 H100 GPUs.

And it's not just one result. There are also AlphaStar, Capture the Flag, Emergent Tool Use...

During training, the agent (black dots) that was finally selected by AlphaStar to face the professional player MaNa, and the development of its strategy compared to its competitors (colored dots) are shown. Each colored dot represents a competitor in the AlphaStar League.

In a short period, there were several important RL demonstration projects. So, since the potential is so obvious, the field will surely progress further, or... or?

Why Has RL Taken a Backseat?

From 2019 to 2022, some work continued, but Reinforcement Learning declined significantly.

Although there were more dissertations in these years, there weren't many breakthroughs of the quality like in the years 2017 - 2019. What happened?

The first factor is academic short - sightedness.

The entire field has collectively decided on a series of standards without a real reason. Under these standards, it's almost impossible to make progress.

For historical reasons, Agent57 is the most commonly used benchmark, which includes a total of 57 Atari games.

Since the results of the tasks vary greatly, all games must be run (ideally with multiple seeds for each game). At the same time, the academic world has decided that the x - axis should be the number of samples, not the actual running time (Wallclock time).

The idea behind this is that this is closer to the learning method in the real world, since many problems are limited by the sampling rate. And you don't have to worry about the hardware settings in different dissertations.

However, the obvious problem is that there is no limit on hardware utilization, and one can improve the benchmark results by investing more computing resources. Therefore, research is becoming increasingly time - consuming, so that running a single game may take several weeks of GPU time.

Since the academic world is averse to engineering, the code is also terribly slow. Not to mention the limited budgets...

In the end, you need 10,000 GPU hours to conduct a series of ablation studies (Ablations) with a utilization rate of less than 5%.

This type of research doesn't work at all and has nothing to do with good science.

Without thousands of GPU hours, it doesn't make sense for many people to conduct ablation studies, so they simply publish dissertations without these studies - no wonder the research results at that time were hardly reproducible.

Moreover, the academic world pursues fame and fortune.

Large Language Models (LLMs) have emerged.

People often ask him why he hates LLMs. He really doesn't hate them. What he hates is that they have attracted 99% of the geniuses from other fields, instead of the more reasonable rate of 80%.

He has seen how the most talented colleagues left the RL research field one after another and were hired for LLM research. It's hard to blame them. The work on RL was terrible. It was hard, cruel work that fought against a system that seemed to be specifically developed to prevent real progress.

The basic things that one is used to in general Deep Learning, even those from 2015, don't exist in RL.

The hyperparameters don't make sense, the models can't be scaled, and simple tasks can't be easily transferred.

Although there were proofs that RL works for amazing problems like DoTA and Go, the feeling in daily work was simply hopelessness.

Today's RL Repeats the Same Mistakes

Long experiment cycles, over - optimized evaluation systems, slow development speed... doesn't this sound familiar?

Modern RL research seems to have spent billions of dollars to reproduce the same problems that originally hindered the development of RL.

David Peterson fully agrees: Reinforcement Learning strangely repeats itself, like in Temporal Difference.

This time, it might go further because it's profitable... but it's extremely inefficient.

It's ridiculous to see how the field gets into the same problems again that the predecessors overcame years ago, and at the same time creates new terms for different concepts.

“Multi - Round RL” means “not just a bandit problem” (not a bandit). This includes almost all new RL research, except for some niche theoretical research.

“Long horizons” (Long - term planning) is also not new, and this is not the whole story of why the problem is so difficult.

Joseph Suarez understands the current mistrust of previous RL research -

because many published contents actually have problems.

Looking for Another Route

Joseph Suarez continues to work on RL with small models from scratch.

It's just that this is no longer the decadent old power, but that it's progressing at an amazing speed.

So, what has changed?

After completing his doctorate, he decided to completely free himself from the arbitrary standards of the academic world and rebuild RL from scratch.

The standard is the Wallclock training time, and performance engineering is as important as algorithm development.

He spent several months tearing down all slow infrastructures, aiming to achieve a throughput of millions of steps per second, instead of just a few thousand.

Initially, this was just an accelerated version of existing methods. This was already enough to solve the problems in the industry that were difficult to implement due to high costs.

But it goes even further - this process has actually enabled them to conduct high - quality research at an unprecedented speed.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Der Fall des "Blutraubs" durch LLMs: Die Talente in der Bereich der Verstärkungslearning werden ausgegraben, und plötzlich wird der Bereich ein "Menschenleergebiet".

Taught by Renowned Teachers

Why Has RL Taken a Backseat?

Today's RL Repeats the Same Mistakes

Looking for Another Route