Design, implementation and future development of reinforcement learning AI systems
Reinforcement learning, as a means to further enhance the intelligence of large language models, has always been the most core and complex part of the large language model training process. The complexity is not only reflected in its algorithms but also in the overall requirements of the system.
This article is compiled from the sharing of Cao Yu, an algorithm expert at Alibaba, at the AICon 2025 Beijing Station in June this year, "Design, Implementation, and Future Development of Reinforcement Learning AI Systems." His sharing starts from the traditional RLHF system and demonstrates the current situation and development context of the RL system through algorithmic practices. Through specific practices, he will discuss the future development direction of ultra-large-scale RL with industry practitioners. The sharing content includes both theoretical foundations and industry practices, and finally, it will also cover the open-source ecosystem and community co-construction.
The following is the transcript of the speech (edited by InfoQ without changing the original meaning).
Today, I'm very glad to share with you some applications of Reinforcement Learning (RL for short) in the system design of large language models and some preliminary suggestions for future development.
From RLxF Theory to Engineering
From the theoretical foundation of reinforcement learning algorithms, its demand for engineering is multi-faceted. Today, we focus more on engineering and the field of AI infrastructure (AI Infra). Therefore, we will mention the basic algorithms relatively briefly. First, there is the algorithm theory. This part seems very abstract and concise, like a cycle. In reinforcement learning, an Agent used to refer to the intelligent agent of reinforcement learning. However, nowadays, the Agents people discuss more often refer to the intelligent agents of large language models. The engineering maturity of the reinforcement learning system algorithms has enabled a good integration of large language models and RL. An Agent is not only the carrier of the reinforcement learning algorithm but also the carrier of the large language model as an action model. In essence, the algorithm theory requires that in the process of continuous interaction between the strategy and the environment, the large language model can explore the world more efficiently, obtain better rewards, and thus better adapt to the environmental goals. At the strategy level, the most crucial thing is the learning algorithm, that is, the reinforcement learning algorithm guides how the strategy updates the gradient and better completes the tasks. For the environment, the most important proposition is the reward function, that is, how to give the correct reward for a problem so that the model can learn truly valuable content.
Looking at the seemingly simple and abstract algorithm theory on the left side of the following figure, its execution logic is actually much more complex than we think. The middle part is the execution logic diagram of a framework I participated in - Open RLxF. Compared with the algorithm theory on the left, it is obviously more complex. Because in the actual execution process, we have multiple algorithm components. For example, the green part is the model in the training state, and the blue part is the model in the inference state. These models interact and influence each other to help the model conduct relevant training. This already seems a bit complicated, but in the actual engineering implementation, the situation is even more complex. Thanks to AReaL of Ant Group, the actual engineering operation diagram of Open RLxF based on it is even more so. This is the current situation of engineering practice.
Regarding the specialized basic theory, we can simply understand that the environment refers to how an Agent interacts with the world. In the scenario of a chatbot, the environment is the way the large language model interacts with humans; in the scenario of a programming intelligent agent, the environment is the interaction between the strategy network, the code executor, and network tools such as browser usage. The environment can be understood as the opponent of the large language model and the Agent based on it, that is, with whom they interact. This is a very important concept. In addition, the strategy is what we hope to express in the form of an Agent. It is that the intelligent agent independently decides what to do next based on the current state (such as the input given by the user and the feedback from the environment). This is an important watershed for the model to evolve from a simple chatbot to an Agent, that is, it can independently choose appropriate behaviors and adopt the optimal strategy based on these behaviors.
After having the environment and the strategy, we also need two important factors. First is the reward function. How to judge the quality of a behavior is a very important input item. In the past one or two years, the implementation of reinforcement learning in large language models has largely relied on the way of reward function modeling and optimization. From the common reinforcement learning with human feedback to the reinforcement learning with constitutional feedback, and then to the current reinforcement learning based on verifiable rules, these continuous improvement processes actually represent that the signal sources of the reinforcement learning reward function are becoming more and more extensive, and the task difficulty is also constantly increasing. The last point is the algorithm itself, which is what we algorithm researchers really focus on. Currently, we see many algorithms, such as the well-known PPO, GRPO, DPO, etc. They are more related to the strategy, that is, how to update the strategy based on the state, action, and reward history function so that the intelligent agent can continuously improve. This is the overall overview of the algorithm aspect.
In terms of algorithms, there are some industry practices. In the past, we were largely engaged in reinforcement learning with human feedback, and it is also one of the main reasons why we can gather here today. What actually triggered this wave of the large model boom was the so-called Instruct GPT. It used the signals of reinforcement learning with human feedback and developed a learning system that can well follow instructions and has understanding ability on the basis of the GPT - 3 base model. Its main training method is relatively primitive, that is, to hand over the evaluation and annotation of the quality of the model - generated responses to humans, and then use another model (instead of humans) to fit the human judgment of the evaluation quality. This model is also a large language model. In this case, for future prompts and responses, we have an approximation of the additional human feedback signals, thus continuously approaching the upper limit of the model's ability. This method has its advantages. Its model structure is relatively simple, the training process is relatively stable, and it uses a widely used mathematical function form. With a large amount of data training, it will have some generalization ability and good effects.
However, it also has disadvantages. In the training process, human annotation and feedback cannot be infinite and cannot cover all aspects of human behavior. Therefore, there will be a so - called "reward hacking" phenomenon, that is, the reward signal is exploited by the model, resulting in unexpected situations. Under this premise, in industry practice, a way of combining human feedback with machine feedback is often adopted.
The following screenshot is from the best practice of DeepSeek's generative reward model. Before outputting the quality of the reward model, it will output an explanation of the score in text form. The advantage of this is that the model can not only simply score based on the existing response pairs but also explain why it makes such a choice. Since it is a generative model, it has a certain degree of generalization. And in the inference process, we can improve the model's ability by sampling multiple pieces of data. In addition, there are also some ways in the industry to use the large language model itself as the reward model, which is more flexible. For example, in the evaluation process of the reward model, we can let it focus on more specific and fine - grained dimensions of the reward model itself, and then use these fine - grained dimensions to meet the special requirements of domain supervision signals in specific business scenarios. However, the cost of this method is relatively high because it uses the large language model for generative inference. Compared with the model that directly outputs values similar to tokens, its cost is higher.
Core Algorithms and Breakthroughs
The core of the algorithm part lies in the source of the evaluation signal, that is, how we design our reward function. From a global perspective, the entire system is actually relatively complex. The figure shows the full - link process of a very classic and traditional PPO algorithm, covering all aspects from inference, evaluation to training. The following figure divides it into three parts with two dotted lines, and today's speech will also revolve around these three parts.
First, there is the inference part in the upper - left corner. Here, inference can be understood as the operation of the inference model, that is, the process in which the large model generates responses based on the input prompt. In this process, the main computational load comes from the inference engine of the model during the pre - processing (profile) and decoding (decoding) stages. This is actually the process of the model interacting with the environment. So, how to conduct training after the interaction is completed? There is also an evaluation process in the middle. As we have briefly introduced, the most traditional way is to use human feedback and conduct approximate learning through a reward model. However, as the speech progresses, you will find that in the field of large language models, especially in reinforcement learning, the value and time - consuming proportion of the evaluation link are becoming higher and higher. Because we need a more comprehensive and integrated method to comprehensively evaluate the model's ability. The middle evaluation process also involves complex interactions and verification processes with the environment, such as the code executor.
The part on the right is what we really call the training process. This process is actually closer to the pre - training and supervised fine - tuning (SFT) of traditional large models. In the traditional SFT and pre - training processes, all the data are prepared offline and statically. For reinforcement learning, all the data are dynamically generated through the online inference and evaluation processes. In this training process, the simultaneous training of multiple models is also involved. For the classic PPO algorithm, the first model is our own model. Through a relatively complex but actually not difficult - to - understand PPO loss function, its main purpose is to limit the step size and amplitude of the update. At the same time, when the gradient is relatively confident, the strategy function is updated based on the advantage. The advantage refers to the average goodness or badness of a certain behavior compared with other behaviors. Since PPO is based on the Actor - Critic architecture, it also has a Critic model. After these two models are trained, they will be pushed back to the inference model on the left through high - speed interconnection, thus forming a continuous multi - round interaction mode that allows the model to improve its ability online. This is the most traditional RLHF training method.
In subsequent practices, our algorithm exploration and practice gradually went in two different directions. First, we found that although the PPO algorithm is relatively complex, in the early days of reinforcement learning last year, due to the high complexity of the entire system, considering our BT reward model, if the signal source is limited to preference pairs, we can design the loss function of the PPO algorithm in another form. This form avoids the training of the reward model and the use of the Critic function. In some business scenarios, the effect of this exploration is relatively good. Its advantage is that we can skip the training of the reward model and the advantage estimation, and there is no need to use the Critic model for training and inference. It can conveniently optimize the quality of preference pairs in some specific and segmented business scenarios. However, its disadvantages are also obvious. First, its assumption is very strong, that is, our reward model must conform to the BT assumption, that is, an assumption of a good - bad pair. However, in the reinforcement learning process, this assumption is sometimes too strict because some fields do not need to compare relative goodness or badness. For example, there are absolute values in the mathematical field itself. In addition, for the BT reward model, it is not a strong assumption. As long as the reward signal is accurate, it can be introduced. At the same time, this is an offline algorithm that does not involve dynamically updating the model during the training process and inferring new samples for training. Therefore, it is sometimes prone to over - fitting. This algorithm, such as the DPO algorithm, you may have heard a lot about. However, with the maturity of the reinforcement learning framework, the further improvement of computing power, and the deepening of people's understanding of reinforcement learning technology, it has gradually faded out of the historical stage.
Another exploration is the GRPO algorithm successfully applied by DeepSeek R1 recently. It has made some evolutions to the traditional PPO algorithm. The main improvement lies in the Critic model. There is a problem here. If the Critic model is used for estimation, then the Critic model itself needs to have seen a relatively large number of historical trajectories to make a more unbiased estimation at the algorithm level. Otherwise, the model may be misled by the Critic function, resulting in inaccurate strategy learning. The GRPO algorithm is very interesting. When making the Critic estimation, it does not use the model estimation method. Instead, it repeats the inference process multiple times and estimates the advantage through the relationship with the mean and standard deviation. After evolving the PPO algorithm into this method, it did not show many advantages in the RLHF era. However, DeepSeek itself focuses more on the effects in pure inference - type scenarios such as programming scenarios. The greatest advantage of this algorithm is actually in inference - type scenarios, which can quickly help us avoid the problems caused by the training cost of the Critic function and the stability of the training algorithm. In the future, the role of the Critic function, that is, the value function in reinforcement learning, is still an open question. Before the era of large language reinforcement learning, the value function was very important. For example, AlphaGo could largely save inference time. In essence, it was to exchange one calculation method for another. Suppose the number of inferences here is very large. For example, instead of 4, it is 16, 32, or even 128. However, through a relatively accurate function, we can conduct inferences at once. In the future, in the case of multi - round long - context interactions, compared with GRPO, the value function may play a better role.
Ultra - large - scale RL Systems
From a macro perspective, the speed of change in the field of reinforcement learning far exceeds our imagination. The content we just discussed mainly focuses on the left part, that is, the application of reinforcement learning with human feedback in the fields of model safety, usefulness, and expressiveness. Most of this work was concentrated at the end of 2022. The progress in the field of reinforcement learning is so fast that it can even be measured in weeks. From RLHF to RLAIF, we can see that the application scope of reinforcement learning has rapidly expanded, from simply aligning with human indicators to pursuing the upper limit of model intelligence, that is, the inference ability.
Compared with traditional reinforcement learning algorithms, the training method of inference - type models does not change much in the algorithm itself, but the system architecture and training fields have changed significantly. Taking the success of DeepSeek during the Spring Festival this year as an example, they used the GRPO algorithm and increased the computing power investment in the verifiable field, thus achieving a significant improvement in intelligence level. This year, we have seen that many large models have achieved scores close to those of 985 university students in the college entrance examination scenarios. You know, last year, large models couldn't even tell which was larger, 9.8 or 9.1. What happened during this year? The evolution behind reinforcement learning and the simultaneous improvement of the base model have played an important role.
In the next stage, with the enhancement of