Stop RL research, former OpenAI researcher: The Internet is the only important technology.
Reinforcement Learning (RL) is considered a necessary step towards achieving Artificial General Intelligence (AGI).
However, in the view of former OpenAI researcher Kevin Lu, current RL is unlikely to achieve a significant breakthrough similar to the leap from GPT-1 to GPT-4. He believes that "we should stop conducting RL research and shift to product development."
His reason is straightforward: the technology that drives large-scale transformation in Artificial Intelligence (AI) is the Internet, not transformers.
In an article titled "The Only Important Technology Is The Internet," he wrote:
"In a low-data environment, transformers will be worthless.
We lack a general data source for RL... What's truly exciting is to find (or create) new data sources for RL!
The Internet itself is an important source of diverse supervision for models and a microcosm of humanity.
The Internet is the technology that truly enables the scaling of AI models."
In the article, he devoted a large portion to discussing a question: If the Internet is the dual of "next token prediction," what is the dual of RL?
"We are still far from discovering the correct dual of RL."
Figure |
Academic Headlines has appropriately edited and trimmed the interview content without changing the original meaning. The following is the result:
People often attribute the progress of AI to milestone papers such as transformers, RNNs, or diffusion, but they overlook the fundamental bottleneck of AI: data. So, what does it really mean to have good data?
If we truly want to continue advancing AI, we should not focus on AI optimization techniques but on the Internet. The Internet is the technology that truly enables the scaling of AI models.
Transformers Are a Distraction
"Inspired by the rapid progress driven by architectural innovations (from AlexNet to Transformer in 5 years), many researchers began to seek better architectural priors. People rushed to bet on designing architectures superior to Transformer. In fact, since Transformer, better architectures have indeed been developed — but the question is, why have we hardly 'felt' any similar significant improvements since GPT-4?"
1. Paradigm Shift
Compute-bound. There was a time when methods scaled with increasing computing resources, and more efficient methods performed better. The key was to "stuff" data into the model as efficiently as possible. These methods not only achieved better results but also seemed to continuously improve with scale.
Data-bound: In fact, research is not useless. Since Transformer, the research community has developed better methods, such as SSMs (Albert Gu et al., 2021) and Mamba (Albert Gu et al., 2023), and more. However, we don't consider them "necessarily better" results: given a certain amount of training computation, we should still train a better-performing Transformer.
But under data constraints, there are more choices: the performance of all methods will eventually converge! Therefore, we should choose the method most suitable for inference, which might be a variant of subquadratic attention mechanism. Such methods may quickly regain focus during inference.
2. What Should Researchers Do?
Now, assume that we are not only concerned about inference (i.e., focusing on products) but also about asymptotic performance (i.e., achieving AGI).
Obviously, optimizing the architecture is the wrong approach.
Determining how to truncate your Q-function trajectory is definitely wrong.
Manually creating new datasets cannot achieve model scaling.
New temporal Gaussian exploration methods may also fail to scale the model.
The majority in the community has reached a consensus: we should study new ways to utilize data, mainly in two aspects: (1) next token prediction and (2) RL. Apparently, we haven't made much progress based on this.
What AI Does Is Just Use Data
These milestone works have provided new ways for AI to use data:
- AlexNet uses next token prediction to utilize the ImageNet dataset.
- GPT-2 uses next token prediction to utilize text data on the Internet.
- GPT-4o, Gemini 1.5 and other native multimodal models use next token prediction to utilize image and audio data on the Internet.
- ChatGPT uses RL to utilize random human preference reward data in chat scenarios.
- Deepseek R1 uses RL to utilize deterministic and verifiable reward data in narrow domains.
In terms of next token prediction, the Internet is the ideal solution: it provides abundant sequence-related data for this sequence-based method.
Figure | The Internet is filled with sequences presented in structured HTML, suitable for next token prediction. According to the arrangement order, you can reproduce various useful functions.
This is not a coincidence: this sequence data is perfectly suited for next token prediction; the Internet and next token prediction are complementary.
1. Planet-scale Data
In a forward-looking speech in 2020, OpenAI researcher Alec Radford pointed out that despite the many new methods proposed at that time, they seemed insignificant compared to collecting more data. In particular, we no longer hope to achieve "magical" generalization through better methods but follow a simple principle: if the model is not told something, then it naturally won't know it.
Instead of manually specifying what to predict by creating a large number of supervised datasets...
We should find a way to learn and make predictions from all things in the "outside world."
You can think of each dataset creation as setting the importance of all other things in the world to 0 and the importance of all things in the dataset to 1.
Poor models! They know so little, yet so many things are still hidden from them.
After the release of GPT-2, the world started to pay attention to OpenAI, and later events proved its influence.
2. If There Were Only Transformers but No Internet
Low-data: An obvious counterfactual is that in a low-data (small data) environment, Transformers would be worthless: they have a worse "architectural prior" compared to convolutional networks or recurrent neural networks. Therefore, Transformers should perform worse than corresponding convolutional neural networks.
Books: A less extreme scenario is that without the Internet, we might pre-train models based on books or textbooks. Among all human data, we usually consider textbooks to represent the pinnacle of human wisdom. Their authors are well-educated and put a lot of thought into each word. Essentially, it represents the view that "high-quality data is better than high-quantity data."
Textbooks: Microsoft's phi model ("Textbooks Are All You Need," Suriya Gunasekar et al., 2023) demonstrates excellent performance for small models but still relies on GPT-4 pre-trained on the Internet for filtering and generating synthetic data. Similar to the academic situation, the phi model is inferior to other models of the same scale in terms of world knowledge, which can be verified by SimpleQA.
Indeed, the phi model has performed quite well, but we haven't seen these models reach the performance of similar-scale models trained on Internet data. Moreover, it's obvious that textbooks lack a large amount of real-world knowledge and multilingual knowledge. However, they perform strongly in compute-bound scenarios.
3. Data Classification
I think this also has an interesting connection with the RL data classification mentioned above. Textbooks are like verifiable rewards: their statements are (almost) always correct. In contrast, books — especially creative writing books — may contain more data about human preferences, making the student models they generate more diverse.
Just as we wouldn't trust o3 or Sonnet 3.7 to write for us, we may think that models trained only on high-quality data lack a certain degree of creativity. Directly related to the above, the phi model doesn't have a good product-market fit (PMF): when you need knowledge, you prefer to use large models; and when you want a model for local role-playing writing, people usually don't choose phi either.
The Beauty of the Internet
Actually, books and textbooks are just compressed forms of Internet data, even if there is powerful intelligence behind the compression. Furthermore, the Internet itself is an important source of diverse supervision for models and a microcosm of humanity.
At first glance, many researchers may find it strange that to make research progress, we need to turn our attention to products. But I think it's very natural: assuming we care about AGI being able to do something useful for humans, rather than just showing intelligence in an isolated environment (like AlphaZero), then it's reasonable to think about the form (product) that AGI takes — I think the co-design between research (pre-training) and products (the Internet) is wonderful.
From: Thinking Machines Lab
1. Decentralization and Diversity
The Internet exists in a decentralized way, and anyone can add knowledge to it: there is no single central source of truth. There are a large number of diverse views, cultural symbols, and low-resource languages on the Internet; if we pre-train models on this content, we can obtain an intelligent agent that can understand a vast amount of knowledge.
This means that the managers of Internet products play an important role in the design of AGI! If we weaken the diversity of the Internet, the entropy of models in RL tasks will significantly decrease. If we delete certain data, an entire subculture will not be reflected in AGI.
Alignment. There is a very interesting result: to obtain an aligned model, you must pre-train it on both aligned and non-aligned data ("When Bad Data Leads to Good Models"; Kenneth Li et al., 2025), so that pre-training can learn the linearly separable direction between the two. If non-aligned data is completely removed, the model will not be able to deeply understand the nature of non-aligned data and why it is considered bad data (Xiangyu Qi et al., 2024; Mohit Raghavendra et al., 2024).
Figure | The higher the Toxigen value, the stronger the toxicity. The model pre-trained on 10% toxic data (10% toxic data + guidance) has lower toxicity than the model pre-trained on 0% toxic data (clean data + guidance).
In particular, the above "toxic" data comes from an anonymous online forum known for unrestricted discussions and the presence of toxic content. Although this is a specific example of the profound connection between products and research (to obtain an aligned research model, we need such unrestricted discussions), there are many other cases that also show that the design decisions of the Internet will affect the training results.
This is an example of non-alignment ("Improving Image Generation with Better Captions," James Betker et al., 2023). This research is the technical basis of DALL-E 3, which is a method to better distinguish "good" and "bad" images by regenerating captions. Now it has been adopted by almost all generative models. This is similar to the like/dislike mechanism in human preference rewards.
2. The Internet Is a Skill Curriculum Library
Another important feature of the Internet is that it contains a wide range of knowledge at various difficulty levels: from educational knowledge for primary school students (such as Khan Academy), to university-level courses (MIT OpenCourseWare), and to cutting-edge science (arXiv). If you only train the model with cutting-edge science data, many implicit and unwritten knowledge cannot be learned.
This is important. Imagine you have a dataset, you train the model on this dataset, and then the model learns the knowledge in this dataset. What next? You can manually collect the next dataset — OpenAI initially hired data annotators at $2 per hour; later, they hired doctoral-level staff at about $100 per hour; and now their cutting-edge models are performing software engineering (SWE) tasks worth $10,000.
But this requires a lot of work, right? We initially manually collected datasets such as