Reinforcement Learning in the Agent Era: AReaL Framework and Best Practices for Agents

With the rapid development of large models and agent technology, reinforcement learning (RL) is becoming the key engine for enhancing the autonomous decision-making capabilities of AI agents.

With the rapid development of large models and Agent technology, Reinforcement Learning (RL) is becoming the key engine to enhance the autonomous decision - making ability of AI agents. However, traditional RL training methods face challenges such as high computational costs, large data requirements, and high system complexity, which limit the large - scale implementation of Agents.

This article is compiled from the sharing of Dr. Wu Yi, an assistant professor and doctoral supervisor at the Institute for Interdisciplinary Information Sciences of Tsinghua University and a former OpenAI researcher, at the 2025 QCon Global Software Development Conference (Shanghai Station), titled "Reinforcement Learning in the Agent Era: The AReaL Framework and Best Practices for Agents". In his speech, he focused on introducing the AReaL, a reinforcement learning training system for Agent models, and its best practices in Agent scenarios. Through real - world data and reproducible code, he demonstrated how AReaL helps developers and enterprises efficiently build agent systems and promotes the transition of AI Agents from the laboratory to industrial applications.

Highlights

Breakthroughs in reinforcement learning technology for Agents;
Exclusive open - source practices;
Implementation of cutting - edge Agent scenarios.

The following is the transcript of the speech (edited by InfoQ without changing the original meaning).

Hello everyone. My name is Wu Yi, and I'm an assistant professor at the Institute for Interdisciplinary Information Sciences of Tsinghua University. For many years, I've been engaged in research related to reinforcement learning and agents. Today, I'm very honored to be invited here to share the work of our team and some new developments in the field of agents for reinforcement learning in the era of large models.

Today, I'd like to share two important viewpoints with you:

Agents are the most important thing in the next 5 years for AGI;
Reinforcement learning is the key technology for Agents.

I hope that in the following sharing, you can have a deeper understanding of these two viewpoints.

1 What the AReaL Team Aims to Do: Building Agents with RL

Let's start with reinforcement learning. Many people's understanding of reinforcement learning began with AlphaGo. At that time, DeepMind used reinforcement learning to train a Go - playing agent, which defeated world - class Go players Lee Sedol and Ke Jie. Later, OpenAI also achieved remarkable results in games like DOTA using reinforcement learning, defeating the world - champion OG team. These events brought reinforcement learning into the public eye. However, in these early applications, reinforcement - learning agents were mostly concentrated in the gaming field. This makes people wonder: In the era of AGI driven by large models, what exactly is the relationship between reinforcement learning and large models?

In fact, the relationship between reinforcement learning and large models was not always so close. It wasn't until between 2020 and 2022 that the situation changed significantly. In 2020, OpenAI launched the API for GPT - 3. Compared with the current API, its functions were quite different. For example, if you asked it to "explain the moon landing to a 6 - year - old child in a few sentences", it might not be able to do the task well. This is because large models are trained based on "next - word prediction", and this training method is not suitable for performing complex instruction tasks.

This problem is called the "instruction - following problem". Simply put, when we issue instructions to the model, we hope it can understand and complete the task, rather than just making next - word predictions. In 2020, large models performed poorly in instruction - following. However, over time, OpenAI continuously improved the API to enable it to better understand and execute user instructions. This improvement process not only enhanced the practicality of large models but also made the connection between reinforcement learning and large models closer.

The solution to the large - model instruction - following problem was the InstructGPT model, first introduced in 2022. Its core is "Reinforcement Learning from Human Feedback" (RLHF). At that time, researchers found that although large models have powerful language - generation capabilities, their outputs often cannot precisely follow human instructions and may even generate unexpected content. To solve this problem, the research team adopted RLHF technology, training a reward model using manually - annotated data, which can judge whether the model's output conforms to human instructions.

Specifically, the researchers first collected a large amount of manually - annotated example data, including task inputs and expected output results. Then, they fine - tuned the pre - trained GPT - 3 model using this data to enable it to initially follow instructions. On this basis, the team further collected preference - ranking data of the model's outputs and used it to train the reward model. Finally, they optimized the model through reinforcement - learning algorithms (such as PPO) to enable it to generate outputs that better match human intentions based on reward signals. It was also based on RLHF technology that OpenAI launched the epoch - making AI product ChatGPT at the end of 2022.

In 2024, with further technological development, inference models emerged, such as the well - known ChatGPT o1 and DeepSeek R1 models. The core technology of these models is "Reasoning RL". After receiving a task, these models will first "think" for a while, generate a large number of intermediate thinking tokens, and then output the final answer. This "thinking" process is actually using reinforcement learning to allow the model to independently explore the optimal solution, thereby improving the accuracy of the answer.

In 2025, the "Agent RL" technology emerged in the AI field, which is an agent model based on reinforcement learning. These models can not only think and reason but also call external tools, such as search engines and browsers, and even operate files in a virtual environment. For example, the Deep Research function of ChatGPT allows users to specify a research topic, and the AI will call various tools, collect and organize information over a long period, and finally generate a detailed report. In addition, products like Minus further expand the capabilities of AI, enabling it to operate PDF files and edit documents in a virtual environment.

Let's examine the development trend of artificial intelligence (AI). Since 2022, with the arrival of the era of general artificial intelligence, we have witnessed the evolution from inference models to agent models. From a product perspective, this process shows two significant trends.

Take ChatGPT as an example. It can quickly respond to users' simple questions, such as asking for the Chinese or English expression of a word, and give an answer almost instantly. However, with technological progress, in the era of inference models, users can pose more complex tasks to AI, such as calculating a physics problem. At this time, the AI will take a minute or even longer to think and finally provide detailed problem - solving steps. In the stage of agent models, the capabilities of AI are further expanded. Users can issue more challenging instructions, such as processing a large number of files or grading homework. For example, we can give 200 sets of homework to AI, and it will complete the grading task in about an hour. From this perspective, on the one hand, the way users interact with AI has changed. In the ChatGPT era, users need to provide very detailed and long prompts to clearly describe their needs. However, in the agent era, the content users need to express becomes more concise and abstract. On the other hand, the output of AI has gradually changed from simple text answers to being able to take proactive actions and even complete a series of complex tasks autonomously on the computer.

Based on these trends, we can make some predictions about the future. From the perspective of interaction, we hope that future AI will be more convenient, and users won't need to provide too many complex instructions. In terms of AI capabilities, we expect it to take on more tasks and even work 24/7. For example, we can provide more computing resources to AI, allowing it to handle multiple tasks simultaneously and even arrange affairs for users proactively. Ideally, users won't need to give explicit instructions, and AI can complete tasks in advance. In fact, this trend has already emerged in some products. For example, ChatGPT Pulse launched by OpenAI has a major change from reactive to proactive. Although it currently only pushes some information to users every day, the emergence of this new proactive response mode means that AI can provide users with more forms of content in advance, such as reports and code. Conceptually, this marks the transformation of AI from requiring explicit user instructions to being able to provide services proactively. I hope that by this time next year, we can see more such proactive agent products.

Looking back at the evolution of AGI products, from the initial dialog - box - style rapid response, to the inference models with a "scratch - paper" function, and then to the agent models with a "virtual computer" (Sandbox), the capabilities of AI have been significantly enhanced. It can not only handle complex inputs and call tools but also store and create files in a virtual environment, almost capable of completing all tasks that humans can do through electronic devices. This is a huge improvement.

Of course, we can also use a more abstract example to illustrate. In China, many bosses are used to saying to their subordinates, "Xiao Li, help me get this done." We hope that future AI will also be like this. Users only need to simply say "Help me get this done", and the AI can understand and execute the task. This involves many complex technical challenges. First, human needs are often vague, and it's difficult to clearly express one's intentions. Second, everyone's needs are personalized, which means AI needs to have a high degree of customization ability. Finally, AI needs to have the ability of proactive planning because some tasks may require advance preparation. We look forward to more breakthroughs in these areas in the coming year.

Back to our team, we've always been focused on research and application in the field of reinforcement learning. We've always had a vision: to create excellent agent models, services, and products at the forefront of agent technology through reinforcement learning. This is the core goal of our team and the direction we're constantly pursuing. Therefore, the first thing we want you to believe is that agent technology is crucial.

So, what characteristics should an excellent agent team have? In the era of general artificial intelligence, the characteristics of the team are particularly important. Take OpenAI as an example. Its team's operation mode is impressive. For example, the initial version of ChatGPT was developed as a demo by a small number of people in just one week and then quickly became popular and developed into a complete team. The same is true for the Deep Research project. Several researchers completed a preliminary demo in two weeks, which then attracted wide attention. Another example is the Codex project, where 17 members completed the development in 7 weeks. These examples fully illustrate the characteristics of the AGI era: fast iteration speed and short innovation cycle.

In the AGI era, everything is developing at an amazing speed. It's difficult for us to predict which products will become blockbusters, but we can be sure that teams that can quickly adapt to this rapid iteration are more likely to succeed. The Manus project is a good example. It developed a phenomenal product in just two months. This shows that a good team may need to make some changes in its organizational structure. We hope that the team can fully integrate AI technology and have a complete technology stack, rather than being divided into multiple independent groups. We hope that the team can quickly turn any idea into a prototype because only through rapid iteration and prototype innovation can we stand out in the fierce competition.

2 Why Agents Need RL: The Example of ASearcher

In this part, let's discuss the technology in depth, especially the relationship between agents and reinforcement learning. Some people may ask, "Professor Wu, we all agree that agents are important, and we're all trying to create agents, but what exactly is the role of reinforcement learning in this?" Indeed, there are already many agent frameworks on the market, such as ByteDance's CoZe, LangChain, LangGraph, etc., and even OpenAI has launched its own agent framework. In this context, reinforcement learning may seem redundant, as the workflow of an agent can be built through simple drag - and - drop operations. So, why do we still need reinforcement learning?

I think the core problem is that the challenges agents face are often very complex and difficult to solve with existing frameworks and rules. In my opinion, there are three main problems that make reinforcement learning indispensable. First, agents need to handle uncertainty and conflicting information. In the real world, conflicting information is everywhere, even within a company. For example, when we search for "Alibaba CTO", we'll find that there are many CTOs in Alibaba Group and its subsidiary Ant Group, but only one is the real group CTO. In this case, the agent needs to collect more information and make judgments to make an accurate decision, rather than simply relying on preset rules.

Second, agents need to have long - term memory and personalization capabilities. Take Meituan Takeaway as an example. A user once said they wanted to eat light food, but in fact, the user didn't like vegetables and wanted to eat light - flavored meat. It's difficult to achieve such personalized needs and long - term memory accumulation through simple rules because they require the agent to dig out the user's real preferences from a large amount of historical records.

Finally, when facing a large number of tools and model options, agents need to have the ability to make autonomous decisions. Different large models have their own advantages and disadvantages. For example, the Claude model has a short context window and high cost, while Gemini has a long context window and low cost, but the code it generates is not very smart. Someone on Reddit once shared an interesting case: they found that they could let Claude call Gemini to read the code repository and then give the result to Claude to write code in Cursor, thus achieving complementary advantages. This shows that when facing many models and tools, the best practice may be to let the agent independently explore the optimal calling strategy through reinforcement learning, rather than relying on manually - written rules.

In addition to these challenges, we also focus on the important trend of Online Reinforcement Learning (Online RL). Recently, Cursor posted an article about online reinforcement learning. Although it has some elements of showing off skills, its view is correct. After a product is launched, continuous iteration through online interaction is the future development direction. However, different from the data flywheel in the era of recommendation systems, the data requirements for reinforcement learning are extremely high and difficult to construct, and not all launched services can meet these conditions. Nevertheless, we hope that in the future, there will be a platform that allows agent models to continuously self - iterate, optimize, and gradually achieve personalization after going online. This is undoubtedly an important development trend, but how to achieve it specifically still requires our joint exploration.

How to solve challenges such as uncertainty, long - term memory, and tool calling in complex tasks through technical means. These problems are particularly prominent in practical applications, and reinforcement learning may provide a unified solution. We hope that through reinforcement - learning algorithms, agents can independently explore in a specific environment, thereby emerging with strong generalization ability to handle various complex product problems. Although this may sound a bit abstract, I'd like to use a specific example to illustrate the challenges and why reinforcement learning is necessary.

In August, our AReaL team released an open - source project called ASearcher, which is a search - agent project. Its task is very simple: the user asks a question, and it searches the Internet and gives an answer. However, even such a seemingly simple question may hide huge challenges. For example, we once asked the question, "How many gold medals did China win at the London Olympics?" At first glance, this seems like an easy question to answer, and we can find the answer through a simple Internet search. But that's not the case.

At the London Olympics, the Chinese delegation was initially reported to have won 38 gold medals. However, later, due to doping violations by other athletes, the number of China's medals changed. Specifically, in the women's track and field race - walking event, the Chinese team originally won the third, fourth, and fifth places. Since the original gold and silver medalists were disqualified for doping, Qieyang Shijie of the Chinese team was awarded a gold medal 11 years later. Therefore, the final correct answer is that the Chinese team won 39 gold medals.

This example shows that even simple questions may involve complex background information and dynamic changes. If the agent doesn't understand this background and only relies on simple search results, it's likely to draw the wrong conclusion. We tested several products, including DeepSeek, ChatGLM, and ChatGPT. Among them, ChatGLM and DeepSeek gave the answer of 38 gold medals, while ChatGPT found the clue of 39 gold medals but still thought 38 was the more common answer. Only the Agent mode of ChatGPT gave the correct answer after being activated.

This indicates that developing a professional search product is not easy. If we want to build an agent through a fixed workflow, we may need to build a complex multi - agent system, including search agents, verification agents, knowledge - calling agents, validation agents, and many other modules. Such a system is not only complex but also difficult to maintain and optimize.

However, if we use the method of reinforcement learning, the situation may be different. Take ASearcher as an example. It is based on a very simple model with only two tools: search and web - page click. Through reinforcement learning, this model can independently explore in the environment and continuously iterate to verify the accuracy of information. In our test, ASearcher found the clue of 39 gold medals in the fifth round of search and finally confirmed the correct answer of 39 gold medals after more than 60 operations. This process not only demonstrates the powerful exploration and reasoning ability of the reinforcement - learning agent.

In fact, we found that the 32B model trained by reinforcement learning performs excellently in multiple benchmark tests, with an accuracy improvement of 20% to 30%. In addition, reinforcement learning also endows the model with stronger generalization ability, enabling it to flexibly call different tools during the testing phase and even replace with more powerful

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。