Why has reinforcement learning become so popular in Silicon Valley? It's a crucial step towards AGI.
RL (Reinforcement Learning) reached the top of the trending list during the AlphaGo era and then faded into obscurity for many years amidst the wave of large models. Now, whether in the technical architecture of AI Agents or model pre - training, building technical architectures using reinforcement learning has become a mainstream trend in Silicon Valley. Top - tier talents in reinforcement learning are becoming highly sought - after by Silicon Valley giants and investors.
In this episode of "Silicon Valley 101", host Hong Jun continues the conversation with Zhu Zheqing, the founder of Pokee.ai and the former head of the Meta AI Applied Reinforcement Learning team. We will discuss:
1. What are the latest technical directions in model evolution and Agent commercialization?
2. The business logic behind Meta's acquisition of ScaleAI
3. The hub of top - tier reinforcement learning talents in Silicon Valley
The following is a selection of the content of this conversation:
01 Five Levels of Reinforcement Learning and AGI: Where is the Real Divide?
Hong Jun: I noticed that in OpenAI's recent press conference, they also mentioned using the underlying architecture of RL (Reinforcement Learning). I know that the RL architecture is also your forte. Could you briefly introduce the advantages of using the RL architecture and what it corresponds to?
Source: Scribbr
Zhu Zheqing: I think it's necessary to briefly explain that there are many types of RL architectures. There are those completely centered around LLM (Large Language Model) and based on tokens, and there is also our type of reinforcement learning where the entire action is involved, meaning that I want the Agent to no longer use language tokens as the elements for decision - making.
There is no clear - cut distinction between the two decision - making methods, and their use cases are also quite different. Generally speaking, the reason for using the RL framework to complete Agent training is that there are goals. Whether it's Deep Research, which may only need to search for relevant information as much as possible through a token - by - token generation method and then complete an entire report; or in an agentic system like Pokee, where my tool might be a tokenized thing, and I can solve a problem by combining many tools. It is all goal - driven. A significant difference from previous LLM training is that LLM training itself can be completed through a large amount of supervised learning data, that is, the auto - regressive training method, while it's difficult for an agentic system. Deep Research can still use some, but when it comes to tool calling, a single tool call can be completed through data, but it becomes difficult to complete auto - regressive training when it comes to a tool chain. For example, if I have a task corresponding to 50 tool calls and then train it with this data, no one has ever seen this data, and it's impossible to obtain this data from the Internet because no one on the Internet has ever generated such data. So, if you insist on using such data, you can only rely on manual annotation.
Image source: Technology Bar
Hong Jun: Which tasks do you think are easier to handle through supervised learning fine - tuning? And which tasks must be done through RL (Reinforcement Learning)? I think the tasks they target are also quite different.
Zhu Zheqing: Yes. Currently, the consensus is that for many existing collaborative data, texts, videos, and pictures in the world, a series of tasks with a large amount of annotated data can generally reach a high level through supervised learning, and then through the post - training method of RLHF (Reinforcement Learning from Human Feedback), the ability can be further improved to better meet the preferences of most humans. The reason for doing this is that a large amount of supervised data is a mix of good and bad, not every data point is what humans like. After training, it generalizes to all Internet data. The next step is to fine - tune my model based on human preferences to make it more in line with the human - preferred pattern. That's the purpose of RLHF.
So, why is there even talk of RL pretraining now? The reason is that many tasks are goal - driven.
Hong Jun: Which companies are doing RL pretraining?
Zhu Zheqing: Currently, only research groups are doing RL pretraining, but we have actually started doing something similar to RL pretraining. However, there is still some prior knowledge that cannot be obtained through pretraining, and it basically skips many intermediate training processes.
What problems does the training mechanism centered around reinforcement learning aim to solve? Many tasks are goal - driven, such as writing code, mathematics, physics, and some aspects of financial institutions; urban planning, operations, research, and supply chain. They all have clear goals, and the world mechanism is also very complete. If a occurs, b will appear. In this case, pretraining becomes less necessary. First, in most of these professional, goal - driven scenarios, there is hardly any data. Mathematics and code are the only two scenarios that may have relatively more data points. Apart from these, there is little data in the other points I mentioned, and it's difficult to obtain a large amount of data on the Internet to complete this training.
Second, essentially, the problems it needs to solve are very general. Most of the data that has appeared in writing is very focused on common code problems and mathematical problems. Very profound and difficult mathematical problems have never appeared. So, it must be in a counter - factual form, meaning that I need to generate outputs such as code, mathematics, and physical planning that have never appeared in the market. Then, rely on a ground - truth validator to tell me whether I'm doing it right or not, and then self - train. This training method is very suitable for use cases with a ground truth and the ability to make accurate judgments, and then optimize. This is when RL (Reinforcement Learning) shines the most. In fact, many studies on the Internet have said that the biggest problem now is verification. If a good verifier can be found, the problem can be considered solved because the optimization of this verifier can be completed through RL. Next, I'll also talk about something that I think may be a bit of a non - consensus view. Above the verifier, the next thing we may most need to accomplish is how to improve the generalization ability of the model or verification mechanism in the verification direction, and when the Agent's output deviates from what people actually see, how to make the verifier adapt to the new output so that it can complete better verification. If someone can achieve this, we may truly step onto the path towards super - intelligence because the knowledge it produces may be beyond human possession.
Hong Jun: If this can be achieved, can it solve the hallucination problem?
Zhu Zheqing: I think the hallucination problem is another issue. This thing is prone to hallucinations. Just like when we saw Alpha Zero (a general reinforcement learning algorithm developed by DeepMind) defeat humans, some of the moves it made were beyond human imagination. Even through this mechanism, it may be possible to discover new physical theorems and knowledge beyond human possession. This may be a key point in truly moving towards super - intelligence, but there has not been a good breakthrough yet.
Hong Jun: Yes. What you said reminds me of OpenAI's five - level classification of AGI (Artificial General Intelligence). In fact, it's because during the power struggle between OpenAI and Microsoft this round, an agreement they signed with Microsoft before was exposed. I think the entire path is moving in the direction you mentioned. The first level of AGI is the chatbot, like the conversational AI of ChatGPT.
Source: OpenAI
The second is the reasoning - type AI. This was also a direction we saw last year. The third direction is the AI Agent. This AI can not only think but also replace humans to perform multi - step autonomous operations and complete a series of tasks, such as booking flights and hotels for a trip. It seems that this year is also moving in this direction. The fourth - level AI is the innovative AI, called innovators. It needs to have creative thinking and be able to independently invent new tools or solutions. For example, in drug discovery, it can discover a new molecule. At this time, AI can already propose solutions that humans have never thought of and then find innovative solutions on its own. As you just said, if there is such a solution, can AI go beyond human capabilities in creative problems and propose solutions that humans have not thought of? The fifth - level AI is the organizational or super - human - level AI. It can independently undertake all the responsibilities of an organization, far exceeding ordinary people, a bit like "super AGI".
Zhu Zheqing: It has to be said that their definition of AI capabilities is actually more focused on product capabilities rather than technical capabilities. In a sense, there is not a huge leap between the second and third levels. It depends on how you define the first level. Because a chatbot can be very ordinary or the ones we see now. And I also think there is not a big gap between the fourth and fifth levels. The main gap is between the third and fourth levels, and the core reason is the insurmountable verification ability. From a human perspective, because the human learning method is very similar to RL (Reinforcement Learning). For example, when you were a child learning something, the things you could judge were all within your knowledge scope. For example, if you learned addition, you could only judge what "1 + 1" and "2 + 2" equal. You couldn't directly generalize to judge what "3 - 2" equals. This reasoning process cannot be fully improved just by internal knowledge. What we call verifiable things, such as reinforcement and fine - tuning, are all knowledge iterations that can be completed through an internal verification system. For example, if there is a fixed verification, you can continuously improve through this verification, or if I preset a certain amount of verification knowledge for you, you can continuously improve based on this knowledge. But if an agent can do 20 - digit addition and subtraction but has never seen subtraction, it still can't verify whether a subtraction is correct or not.
Hong Jun: I think it's the same for humans. Haha. Suppose I study mathematics, or I've never studied biology. With my knowledge in the field of mathematics, if I don't know the underlying logic of biology, it's also difficult for me to generalize.
Zhu Zheqing: Yes. So, the two most difficult aspects are:
1. How to obtain the verification from a to b through a simple description given by humans, such as the relationship between subtraction and addition? If this can be achieved, the generalization ability of the Agent's verification will reach the next level.
2. Can it, through self - exploration and based on the grounding of existing knowledge, extend the verification of future knowledge? This is also very difficult. For example, if you already know that most alkalis and acids react to produce carbon dioxide, can you have a simple understanding of the properties of carbon dioxide and verify future problems related to carbon dioxide? This is also extremely difficult. In the future, if there are results produced by similar Agents, can we verify whether these results are correct or not? This is also very, very difficult.
Hong Jun: So, when we talk about the five levels of AGI, the transition from the third - level agent - type AI to the innovative AI may be the time point for crossing from below human level to above the average human level, or even above the best human level.
Zhu Zheqing: Yes. So, the gap between the third and fourth levels is much larger than that between the first, second, and third levels and the fourth and fifth levels. I think there may be something very delicate about the fifth level: will there be politics among Agents like among humans? Because if Agents are decentralized, their objectives may be misaligned, and politics may emerge in a decentralized multi - agent system.
Hong Jun: Do you mean politics among humans, like office political struggles?
Zhu Zheqing: Yes, but the situation will be completely different in the Agent environment because their objectives will conflict with each other. Once there is a conflict, it will get stuck, a bit like a racing condition in a computer system and will directly lock up.
Hong Jun: The paper - clip problem.
Zhu Zheqing: Yes, a similar situation may occur. However, the gap between the first, second, and third levels and the fourth level is a huge chasm. If someone can solve it, it will be a very significant breakthrough.
Hong Jun: Are there any large companies trying to solve these problems along the path you mentioned, such as using RL (Reinforcement Learning) for pretraining and improving the generalization of the verification mechanism?
Zhu Zheqing: There hasn't been any significant breakthrough in the generalization of the verification mechanism. Currently, Human Knowledge Distillation is used to improve the verification ability.
There are indeed many people talking about reinforcement learning pretraining, but it has a fatal weakness. Since RL is a completely counter - factual learning process, an unavoidable problem is whether it will produce solutions that can solve problems but are incomprehensible to humans. For example, we can verify the input and output of a piece of code. Then an Agent writes a piece of code that can actually run, but all the operators in it are incomprehensible to you. For example, the definitions of object variables are all garbled; its addition, subtraction, multiplication, and division are all written in a very complex compilation language and then forced into the original code. Humans can't understand it, but it can run. So, the reward definition is very important. For example, how about human readability? But human readability can't be solved by a rule, so it becomes unverifiable.
Hong Jun: It sounds like the world is quite dangerous. I can roughly understand why Jeffrey Hinton regrets creating the underlying technology of AI. For example, when AI can write code beyond human knowledge in a language unknown to humans, it's quite dangerous.
Zhu Zheqing: Richard S. Sutton should regret it more . Because the neural network created by Jeffrey Hinton is more capable of representing human knowledge, while to achieve counter - factual knowledge discovery or policy discovery, it still depends on RL (Reinforcement Learning). I think if we talk about regulatory information in the end, some regulatory efforts may be needed for reward design. The incentives given to an Agent during training may determine what the trained Agent will be like.
Hong Jun: Yeah. When we were comparing reinforcement learning with SFT (Supervised Fine - Tuning) learning, I also heard a statement (I mentioned it in a previous episode). For example, the effect of using reinforcement learning is twice as good as that of SFT, but it may consume ten times as many tokens. For the current commercialization and application, the cost - effectiveness doesn't seem to work out. What do you think?
Zhu Zheqing: Yes, this is normal. Because the approach of reinforcement fine - tuning is that I only have a reward function and no other information, and I need to achieve the goal. While SFT means that I already have the standard answer, and I just need to find a way to get closer to it. Unavoidably, the cost of RL fine - tuning is higher. But in the