Dwarkesh Patel: The next generation of AI is likely to be built by actually getting things done.
Dwarkesh Patel, a well - known tech podcast host in Silicon Valley, recently posed a question: What will be the next - generation training paradigm for AI?
Dwarkesh Patel is a tech podcast host and writer who has quickly gained popularity in Silicon Valley in recent years. At only 25 years old, he has entered the core circle of AI discussions through his Dwarkesh Podcast. His interviewees include a group of AI and tech experts such as Ilya Sutskever, Andrej Karpathy, Dario Amodei, Demis Hassabis, and Mark Zuckerberg. TIME included him in the 2024 TIME100 AI, stating that his podcast has become an important listening content for many AI practitioners.
In the latest episode of his podcast, he summarized the current bets of cutting - edge AI labs into a single keyword: RLVR, which stands for Reinforcement Learning with Verifiable Rewards.
Simply put, it means allowing the model to repeatedly trial - and - error in a large number of tasks where the right or wrong can be automatically judged, and training the model's planning, error - correction, iteration, and long - term execution capabilities. The rapid progress in fields such as code and mathematics today largely comes from this idea.
But what Dwarkesh really wants to ask is: Is it enough for the next - generation AI to rely solely on this "verifiable task training"?
His answer is: Probably not.
Because it's not enough for a task to be just "verifiable"; it must also be "grindable".
The key concept here is grindability. In the context of AI training, it means "the ability to repeatedly solve problems" or "the ability to be rolled out on a large scale".
Code tasks are typical grindable tasks. You can prepare a software repository, a bug to be fixed, and a test case, and then replicate the same environment thousands of times, allowing thousands of agents to try simultaneously. Those who pass the test will score. This process is parallel, reproducible, and resettable, making it particularly suitable for RLVR.
Mathematical problems are similar. Whether the answer is correct can be verified, and the training environment is easy to replicate.
But Dwarkesh asked a very interesting question: Why has AI made slower progress in "using a computer" than in code and mathematics?
On the surface, computer usage is also verifiable. For example, whether an order has been successfully placed, whether an event venue has been booked, and whether a tax form has been submitted can all be judged. However, the problem is that it is difficult to replicate and replay on a large scale. You can't let a thousand agents repeatedly run the same checkout process on Amazon simultaneously because real websites can identify bots, ban accounts, and change states. You can of course clone applications like Slack, Gmail, and Amazon to create a simulator, but this is still a high - cost and low - scalability project at the current stage.
Dwarkesh pointed out: AI makes rapid progress in a certain field not only because the answers in this field are verifiable, but also because this field can be packaged into a training environment that is replicable, replayable, and allows for parallel trial - and - error.
This also explains why code, mathematics, and game - related tasks have become natural breeding grounds for RLVR, while many real - world tasks are difficult to directly incorporate into this training paradigm.
Then, he extended the question to the more complex real world.
- If we want to train an AI to start a business from scratch, what should we do?
- If we want to train it to win a lawsuit, what should we do?
- If we want to train it to make stable profits in the market or help a candidate win an election, what should we do?
These tasks certainly have outcomes. Whether a company has been established, whether a lawsuit has been won, whether a transaction has been profitable, and whether an election has been won can all be judged in the end.
But the problem is that the feedback is too slow, there are too many variables, the world cannot be reset, and it cannot be replicated a thousand times in a data center.
A startup may last for several years. A political campaign depends on the specific region, candidates, voter sentiment, media environment, and accidental events. A legal case cannot be replicated into a thousand parallel universes from the same starting point for different agents to trial - and - error respectively.
This type of environment in reinforcement learning is close to the so - called reset - free, non - stationary environment: it cannot be reset at will, and the environment itself is constantly changing.
Dwarkesh therefore asked: Can agents trained in verifiable and grindable environments by RLVR really generalize to these real - world tasks?
This is not a question that can be answered by slogans; it is an empirical question.
Optimists will say that as long as there are enough and complex RLVR environments, the model will eventually learn general agent capabilities. The planning and trial - and - error capabilities it has developed in code, mathematics, web pages, and tool usage will eventually be transferred to fields such as entrepreneurship, organizational management, politics, law, and scientific research.
But Dwarkesh is skeptical about this.
Because the most valuable knowledge in the real world often does not appear in a clear, verifiable, and repeatable way. It may come from a vague customer feedback, a failed meeting, an implicit process within an organization, or a failure mode that is only exposed in real tasks. To learn these things, the model cannot rely solely on "problem - solving"; it must also have real sample efficiency.
This brings the discussion to the most important point in the article: learning back to the weights.
Today's large models are already very good at in - context learning. It can read a lot of materials in a long context, understand the background of a project, and temporarily adapt to the needs of a user or an organization. However, the problem is that most of this learning stays within the context window. After the conversation ends, the model does not necessarily really "remember".
Dwarkesh believes that this is a huge waste.
Because the truly valuable training signals for the model appear precisely after deployment. The model is used by real users, enters real organizations, participates in real tasks, and exposes real errors. It will see how a company operates internally, what people actually use it for, where failures often occur, and which suggestions are simply unworkable in reality.
But if these experiences cannot be precipitated back into the model's weights, it is just a short - term adaptation in a conversation, rather than a long - term growth of capabilities.
He made an analogy with human learning: People do not become stronger by memorizing every single thing that happens every day word for word. An employee becomes useful after working for half a year not because he remembers every email and every meeting record, but because he has compressed these experiences into judgment, intuition, process understanding, and problem patterns.
The model should be the same.
True continual learning is not about infinitely expanding the KV cache, nor about stuffing all historical records into the context. It is about extracting a small amount of truly useful knowledge from real experiences and then compressing them into the weights.
This is exactly the problem that Dwarkesh believes the next - generation training paradigm must solve.
So, how to do it specifically?
He mentioned a direction that is being discussed: on - policy self - distillation, abbreviated as OPSD.
It can be roughly understood as: Let a model that has accumulated a large amount of experience in a long - term conversation act as an "old employee" or a teacher; then train the base model so that it can make similar judgments as the teacher without having the complete context.
That is to say, distill what the model has learned through the context in a real task back into the model's own weights.
This is different from ordinary SFT. The most basic SFT may just let the model predict the tokens that have appeared in the conversation, which is equivalent to asking it to repeat the entire work log. But this is not effective learning. What really matters is not remembering all the details, but extracting the key insights that can help the model do better next time.
The advantage of OPSD is that it does not necessarily require an external verifiable reward. As long as the model can learn useful things in the context, it can use the "model after learning" as a teacher and let the base model approach it.
At the same time, compared with ordinary RL, which only has a final reward, OPSD can provide more intensive supervision signals. It can compare the probability distribution differences between the teacher and the student at the token level, thereby compressing the scarce experiences in a real task into smaller and more precise weight updates.
In addition to OPSD, Dwarkesh also proposed another direction: dreaming.
Here, dreaming means that AI constructs a simulated environment based on real - world observations and then repeatedly practices, tries strategies, and reinforces effective behaviors in it.
This sounds very similar to model - based RL in the traditional reinforcement learning, and also similar to what Sutton has always emphasized, that is, agents accumulate experience through interaction with the environment. The difference is that Dwarkesh has placed it in the context of large models and real - world deployment.
For example, after an AI observes a certain business process in a real company, instead of just writing a summary, it spends a lot of computing power to construct a "game - like simulated environment" for this process. Then it tests different communication strategies, execution paths, and project promotion methods in it to see what is more likely to succeed. Finally, it compresses the experiences obtained from these simulated practices back into the model.
If this approach is feasible, it may become a new scaling axis.
In the past, the expansion of AI mainly came from three axes: pretraining, RL, and inference - time compute. Dwarkesh envisions that in the future, there may be a fourth axis: test - time training, or dreaming. The model not only performs inference but also constructs a simulated environment for specific users, specific organizations, and specific projects during the inference and task execution process and trains itself in it.
This is why someone in the comment section mentioned David Silver and Richard Sutton's "Welcome to the Era of Experience": That article also emphasizes that AI cannot always rely on human data, and the key to the next stage will be for agents to obtain experience from their own interaction with the environment.
Dwarkesh has concretized this macro - judgment into today's large - model training problem: RLVR is an important transitional stage that allows the model to develop agent capabilities in verifiable tasks; but to enter the more complex real world, the model must learn continuously from real - world deployment and write the experiences back into the weights.
In Dwarkesh's vision for 2027 or 2028, the training process may be like this:
- First, RLVR trains a basically competent agent. When this agent is faced with an unfamiliar problem, it can at least figure out the situation, try different strategies, and continue to iterate after encountering obstacles;
- Then, this agent is deployed in the real world to start doing real work. It may work continuously with the user for a week and participate in a project that is not in the original training distribution;
- After a week, the user gives it a thumbs up or thumbs down, or even writes a work evaluation. If the result is positive, the model will distill what it has learned from this task back into the base model. This process may use OPSD, dreaming, or some new technology that has not yet emerged.
Once this path is successful, the ability boundary of AI will no longer be limited by those initial "verifiable tasks".
It can first learn code, mathematics, web tasks, and tool invocation through RLVR; then learn organizational management, business processes, and complex collaboration through real - world deployment; and then continue to expand to adjacent fields based on these experiences.
This also means that the main source of AI progress may change.
In the past, a model was trained before release, and users only used it. For the next - generation model, it may be: train a basic agent before release, and continue to learn through a large number of real tasks after release. Every interaction with the user, every execution of a real project, every failure and correction may become the material for the next round of ability improvement.
So, what Dwarkesh calls the "next - generation training paradigm" is not simply about making the model larger, having more data, or having stronger RL.
It really points to: AI moving from pre - release training to post - release learning; from human data to environmental experience; from temporary adaptation in the context to long - term capabilities in the weights.
The most important AI training data in the future may no longer be just the existing text on the Internet, nor just the verifiable tasks constructed in the laboratory, but the experiences that AI accumulates when completing real tasks in the real world.
Reference link:
https://x.com/dwarkesh_sp/status/2070551894674555081
This article is from the WeChat official account "MachineHeart" (ID: almosthuman2014). The author focuses on AI training. It is published by 36Kr with authorization.