The Father of Chain-of-Thought Reasoning Switched to Meta, Not Just for $100M, and Spilled Beans Before Leaving OpenAI

Meta spends a fortune to recruit AI talents. Jason Wei's job - hopping reveals the future of the asymmetry between reinforcement learning and verification.

Did Jason Wei leave OpenAI just for Meta's sky - high salary? His departure blog reveals the secret: the future of AI is even more promising!

The talent war in Silicon Valley is heating up!

In the past, OpenAI attracted talents from companies like Google. Now, Meta is directly throwing money to poach people.

The compensation packages for top AI talents are astronomical. $100 million is just the starting price given by Zuckerberg!

Jason Wei, the father of Chain of Thought and a Chinese - American AI scientist, jumped from Google to OpenAI and has just switched to Meta.

Jason Wei is very productive in the field of AI.

According to Google Scholar statistics, he has 13 papers with over 1000 citations. His collaborators include well - known AI researchers such as Jeff Dean and Quoc V. Le. He participated in projects like OpenAI's GPT - 4, GPT - 4o, o1, and in - depth research.

Before the media reported his departure, he published two blogs, which may give us a clue as to why he chose to leave.

Surprisingly, all these inspirations come from reinforcement learning!

Life Lessons from RL: Everyone Has Their Own Talent

In the past year, he started to learn reinforcement learning crazily and thought about it almost all the time.

There is a core concept in RL: always try to be "on - policy". Instead of imitating others' successful paths, it's better to take action, get feedback from the environment by yourself, and keep learning.

Of course, at the beginning, imitation learning is very necessary. Just like when we start training a model, we must rely on human demonstrations to get basic performance. But once the model can produce reasonable behaviors, people tend to abandon imitation because to maximize the model's unique advantages, it can only learn from its own experience.

A typical example is that training a language model to solve math problems with RL yields better results than using human - written chains of thought for supervised fine - tuning.

The same goes for life.

We grow up by "imitating" at the beginning. School is this stage, which is quite reasonable.

Studying others' successful ways and then copying them. Sometimes it works, but after a long time, we realize that imitation can never surpass the original because everyone has their own unique advantages.

Reinforcement learning tells us that if we want to surpass our predecessors, we must blaze our own trails, accept external risks, and embrace the rewards it may bring.

He cited two relatively niche habits that he enjoys more:

Read a large amount of raw data.
Conduct ablation experiments to disassemble the system and see the independent functions of each component.

Once when collecting a dataset, he spent several days reading each piece of data and then wrote personalized feedback for each annotator. The data quality soared, and he also gained unique insights into the task.

At the beginning of this year, he spent a whole month ablating the "random" decisions in his past research one by one. Although it took a lot of time, he figured out which type of RL really works and gained a lot of unique experiences that others can't teach.

More importantly, doing research according to his own interests not only makes him happier, but he also feels that he is building a more distinctive and personal research direction.

So, to sum up: imitation is indeed important and is a necessary step to start. But once you get on your feet and want to surpass others, you have to be on - policy like in reinforcement learning, follow your own rhythm, and give full play to your unique advantages and disadvantages 😄

The Future of AI

Verification Asymmetry means that for some tasks, verification is much simpler than solving.

With the breakthrough of reinforcement learning (RL), this concept is becoming one of the most important ideas in the field of AI.

Upon closer inspection, verification asymmetry is everywhere:

Sudoku and crossword puzzles: Solving a sudoku or crossword puzzle is very time - consuming as you have to try various possibilities to meet the constraints. But verifying whether an answer is correct is very simple; you just need to check if it conforms to the rules.
Website development: For example, developing a website like Instagram requires an engineering team to work for several years. But verifying whether the website is working properly only takes a few minutes for an ordinary person, such as browsing the pages and checking if the functions are available.
BrowseComp problems: To solve this type of problem, you usually need to browse hundreds of websites, but verifying a given answer is much faster because you can directly search if the answer meets the constraints.

For some tasks, the time for verification is comparable to that for solving. For example:

Verifying the result of adding two 900 - digit numbers takes almost the same time as calculating it yourself.
Verifying whether the code of some data - processing programs is correct may take as much time as writing the code yourself.

For some tasks, verification takes more time than solving. For example:

Checking all the facts in an article may take more time than writing the article itself (referring to Brandolini's law: "The energy required to refute a rumor is an order of magnitude greater than that to create it").
Proposing a new diet therapy only takes one sentence: "Only eat wild beef and broccoli", but verifying whether it is healthy for the general population requires large - scale experiments for many years.

Pre - research can make verification easier. For example:

Math competition problems: If you have the key points of the solution, it is very simple to verify whether the answer is correct.
Programming problems: Reading the code to verify its correctness is troublesome. If you have sufficient test cases, you can quickly check any given solution; in fact, Leetcode does it this way. In some tasks, verification can be improved but not made simple enough.
Partial improvement: For example, for the task of "naming Dutch football players", having a prepared list in advance can greatly speed up verification, but you still need to manually check some obscure names.

Why is verification asymmetry so important?

The history of deep learning proves that whatever can be measured can be optimized.

In the RL framework, verification ability is equivalent to the ability to build a training environment. Hence, the Verifier's Law is born:

The training difficulty for AI to solve a task is proportional to the task's verifiability. All solvable and easily verifiable tasks will eventually be conquered by AI.

Specifically, the difficulty of AI training depends on whether the task meets the following conditions:

Objective truth: Everyone has a consensus on what a "good answer" is.

Fast verification: Verifying an answer only takes a few seconds.

Scalable verification: Multiple answers can be verified simultaneously.

Low noise: The verification result is highly correlated with the answer quality.

Continuous reward: The quality of multiple answers can be ranked.

In the past decade, mainstream AI benchmark tests have met the first four conditions - that's why they were conquered first. Although most tests do not meet the fifth condition (black - and - white judgment), a continuous reward signal can still be constructed through sample averaging.

Why is verifiability important?

The fundamental reason is that when the above conditions are met, each step of the neural network's gradient carries high - volume information, and the iterative flywheel can rotate at high speed - this is also the secret why the digital world progresses much faster than the physical world.

The Case of AlphaEvolve

Google's AlphaEvolve is the ultimate form of the "conjecture - verification" paradigm.

Take the example of "finding the smallest circumscribed hexagon that can accommodate 11 unit hexagons":

It perfectly fits the five characteristics of the Verifier's Law.
Although it seems like "over - fitting" to a single problem, scientific innovation precisely pursues this ultimate optimization of training set = test set - because each problem to be solved may contain great value.

After understanding this principle, one realizes that verification asymmetry is everywhere, just like air.

Imagine a world where all measurable problems will eventually be solved.

The boundaries of intelligence will be jagged: in verifiable tasks, AI will be invincible because these fields are easier to tame.

How can such a future scenario not be fascinating?

References

https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law

https://www.jasonwei.net/blog/life-lessons-from-reinforcement-learning

This article is from the WeChat official account "New Intelligence Yuan", author: KingHZ. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The father of chain-of-thought reasoning switched to Meta not just for $100 million, and spilled the beans before leaving OpenAI.

Life Lessons from RL: Everyone Has Their Own Talent

The Future of AI

The Case of AlphaEvolve

References