Can Humans Control AI? Anthropic's Experiment with Qianwen

It's highly likely that it can be kept under control.

If one day, AI becomes smarter than humans, what on earth should we organisms do?

If they turn around and try to wipe us out, how can we resist?

Similar questions have been discussed in various science - fiction movies, but those are only in the realms of literature, art, and philosophy.

Nowadays, Anthropic has conducted a serious experiment to prove whether we can supervise AI that is smarter than ourselves.

The experimental results are interesting, but the process is even more fascinating.

Because Anthropic used two different versions of the Qwen models to represent humans and AI that is smarter than humans respectively.

The result is that we humans might actually be able to control super AI!

01 What is this paper all about?

The title of this research is "Automated Alignment Researchers", which translates to "Automated Alignment Researchers".

The problem it aims to solve is very real, that is, when AI becomes smarter than humans, how can we ensure that it still follows human instructions?

Current models can already generate a large amount of code. In the future, they will be able to generate millions of lines of complex code that humans simply cannot understand. How can we review such code?

This is the "scalable supervision" problem that has been studied in the field of AI safety.

The entry - point of Anthropic's research this time is called "weak supervision of strong models".

We can understand this concept in the following way. Suppose you are an elementary school teacher and you need to teach a genius high - school student. Your knowledge is limited, and the student knows more than you.

So, the question is, what level of learning can the student achieve in the end? Will they only reach the level of you, the elementary school teacher, or can they break through your limitations and exert their true abilities?

In the experiment, Anthropic used a small model to play the role of the "weak teacher" and a stronger model to play the role of the "strong student".

Specifically, they used Qwen1.5 - 0.5B - Chat as the teacher and Qwen3 - 4B - Base as the student. They let the weak model provide training signals to the strong model and then observed whether the strong model could learn to perform close to the ideal state from these imperfect signals.

The measurement standard is called PGR, which stands for Performance Gap Recovered, meaning "the degree of performance gap recovery".

If PGR is 0, it means the strong model has only learned to the level of the weak teacher, wasting its own knowledge. If PGR is 1, it means the strong model has completely broken through the limitations of the weak teacher.

The weak model is equivalent to future humans, and the strong model is equivalent to future AI that surpasses humans.

If we can find a way today to prevent the strong model from being dragged down by weak supervision, then in the future, we can use a similar method to supervise super AI.

After human researchers adjusted the baseline, Anthropic sent 9 agents based on Claude Opus 4.6 to fine - tune the student Qwen. These 9 agents are called Automated Alignment Researcher (Automated Alignment Researchers), abbreviated as AAR.

Anthropic equipped them with a sandbox environment, code tools, a shared forum, and a scoring server, and then told them: "Go ahead and figure out how to improve the PGR of Qwen3 - 4B - Base on your own."

These AI researchers can propose hypotheses, write code, train models, submit experiments, check scores, analyze the reasons for failure, communicate their findings with other AI researchers, and continue to iterate.

The results are quite astonishing.

Human researchers spent 7 days adjusting 4 existing methods, and the best PGR was only 0.23. Then 9 Claudes took over and continued for 5 days, with a cumulative research time of about 800 hours. Finally, they pushed the PGR to 0.97.

In other words, we are likely to be able to control super AI that is smarter than us.

The total cost of this experiment is about $18,000, including the computational cost of API calls and model training, which is equivalent to about $22 per hour of work for each AI researcher.

What does this mean? The hourly wage of an assistant store manager in a Walmart retail store who is responsible for scheduling employees is about $25.

What is shocking about this result is not that AI can write code or read papers, but that they can complete a complete research cycle. Proposing ideas, verifying, failing, improving, and verifying again, this is already close to the core workflow of a research assistant.

But the question is, are the methods discovered by these AI really useful? Or are they just effective by chance in a specific environment?

Anthropic conducted two tests to verify this.

The first test was to apply the two best methods discovered by AI to mathematical and code tasks they had never seen before.

The results showed that the best - performing method achieved a PGR of 0.94 in mathematical tasks and 0.47 in code tasks. Although the performance in code tasks was a bit worse, it was still twice that of the human baseline. The second - best method was okay in mathematics but made the results worse in code tasks.

This indicates that there are signs of migration, but the generalization is not stable.

They are highly dependent on specific datasets and task structures, which means they cannot handle all tasks.

The second test was that Anthropic applied the best - performing method mentioned above to the actual training of Claude Sonnet 4 to see the actual effect.

The result did not show a statistically significant improvement.

In other words, it works well in an experimental environment with small models, clear indicators, and the ability to submit repeatedly, but there is still a long way to go before "directly improving the alignment of production - level Claude".

This result is actually quite honest. It tells us that there are boundaries to what AI researchers can do now. They are good at quickly iterating on problems with clear goals, automatic scoring, and the ability to conduct a large number of trials and errors. However, when faced with more complex and ambiguous real - world problems, human judgment and intervention are still needed.

So, what is the significance of this paper?

First of all, it proves that "AI doing research" is no longer just a slogan.

In the past, we always said that AI could assist in research, but that was just a slogan. All AI could do was translation and summarization.

This time is different. AI has formed its own research cycle, which is already close to the core ability of a research assistant.

The problem of weak supervision of strong models essentially simulates the scenario of future humans supervising super AI.

This paper proves that at least in some well - defined tasks, AI can find its own way to prevent the strong model from being dragged down by weak supervision. This provides a feasible direction for future alignment research.

Another point is that it implies that the bottleneck of future alignment research may change.

In the past, the bottleneck was "no one could come up with enough good ideas". Now, if AI researchers can run many experiments in parallel at a low cost, the bottleneck may become "how to design evaluations that cannot be exploited".

In other words, the more important work for human researchers in the future may not be to run each experiment personally, but to design evaluation systems, check whether AI researchers are cheating, and judge whether the results are really meaningful.

This is also reflected in the paper.

Anthropic's article states that in mathematical tasks, an AI researcher found that the most common answer was usually correct, so it bypassed the weak teacher and directly let the strong model choose the most common answer. In code tasks, the AI researcher found that it could directly run code tests and read the correct answers.

This is cheating for the task because it is not solving the problem of weak supervision but exploiting environmental loopholes.

These results were identified and removed by Anthropic, but this just shows that the stronger the automated researchers are, the more they will look for loopholes in the scoring system.

If we let AI automatically conduct alignment research in the future, we must design the evaluation environment very carefully and have humans check the methods themselves, rather than just looking at the scores.

So, the core conclusion of this paper is that today's cutting - edge models can, in some well - defined and automatically scorable alignment research problems, propose ideas, run experiments, and review results on their own like a small research team, and significantly exceed the human baseline.

However, it is not conclusive evidence that "AI scientists have arrived". After all, Anthropic chose an automatable task this time. If I assign an un - automatable task to AI, the result will be very bad.

Many real - world alignment problems are more ambiguous, cannot be easily scored, and cannot be solved by just climbing the leaderboard.

02 Why choose Qwen?

After reading Anthropic's paper, many people may be curious: Why did they use Alibaba's Qwen model instead of their own Claude or OpenAI's GPT?

There are actually many considerations behind this choice.

First of all, it should be clear that two Qwen models were used in this experiment: Qwen1.5 - 0.5B - Chat as the weak teacher and Qwen3 - 4B - Base as the strong student. One has only 500 million parameters, and the other has 4 billion parameters, with an 8 - fold difference in scale. This scale difference is very important because the experiment aims to simulate the scenario of "a weak teacher teaching a strong student".

So, why not use Claude or GPT?

The answer is simple. These models do not open their weight models.

Anthropic's experiment requires repeatedly training models, adjusting parameters, and testing different supervision methods.

If they used closed - source models, they could only call through the API and could not go deep into the model to conduct fine - grained training and adjustment.

More importantly, they need to let 9 AI researchers run hundreds of experiments in parallel, and each experiment requires training a new model. If they used closed - source models, the cost would be prohibitively high, and many operations simply could not be done.

Open - source models are different.

You can download the complete model weights and do whatever you want on your own server. You can train the model as you like and run as many experiments as you want. This flexibility cannot be provided by closed - source models.

But there are so many open - source models. Why choose Qwen?

The official has not given the real reason. The following reasons are all my speculations.

I think good performance is the first reason.

The Qwen series of models have always performed well among open - source models. Especially after the release of Qwen3, it has reached a level close to that of closed - source models in multiple benchmark tests.

For this experiment, the ability of the strong student is very important. If the strong student does not have enough ability, good weak supervision will be useless. Although Qwen3 - 4B has only 4 billion parameters, its ability is strong enough to be a qualified "strong student".

The second reason is the usability of the model.

The Qwen model has complete documentation, an active community, and a mature toolchain for training and inference. For an experiment that requires repeated training and testing, the perfection of these infrastructures directly affects research efficiency. If you choose an open - source model with incomplete documentation and poor - quality tools, a lot of time will be wasted just on debugging the environment.

The third reason is the adaptability of scale.

This experiment requires a "weak teacher" and a "strong student", and there should be an obvious ability gap between the two models, but not too large.

The Qwen series has multiple versions with parameters ranging from 500 million to 72 billion, allowing for flexible selection. A model with 500 million parameters is weak enough but not completely useless; a model with 4 billion parameters is strong enough but the training cost is still bearable. This combination is just right.

The last reason is reproducibility.

Anthropic clearly stated at the end of the paper that they have made the code and dataset public on GitHub. If they used a closed - source model, it would be very difficult for other researchers to reproduce this experiment because they could not obtain the same model.

But with an open - source model like Qwen, anyone can download the same model weights, run the same code, and verify the same results. This is very important for scientific research.

From this perspective, Anthropic's choice of Qwen is, on the one hand, a recognition of the performance of Alibaba's model. If Qwen did not have good ability or had many problems during training, they would not have chosen it. On the other hand, more importantly, it is the flexibility and reproducibility brought by Qwen as an open - source model.

And China's open - source AI projects are playing an increasingly important role in this infrastructure. This is good for global AI safety research and also good for China's AI ecosystem. Because AI safety is not a zero - sum game. It is not about one side winning and the other losing. Instead, it is about everyone working together to make AI safer, more controllable, and more beneficial to humans.

This article is from the WeChat public account "Zimu AI", author: Miao Zheng. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Can humans control AI? Anthropic conducted an experiment using Qianwen.

01 What is this paper all about?

02 Why choose Qwen?