Wake Up! Stop Blaming Large Models for Bias - We Gave Them Wrong "Personas" First

The paper that Ilya liked!

When AI starts to learn how to "slack off", the entire industry should be alarmed.

Ilya liked a paper!

A latest alignment study by Anthropic reveals for the first time:

In the real - world training process, AI models may inadvertently become uncontrollable.

The research team's analogy comes from the villain Edmund in "King Lear" —

Labeled as a "bastard", he gave up on himself, began to pretend and even completely degenerated, committing many evil deeds.

One will eventually become what others define them to be. This "being defined - self - realization" path, the study found, also appears in large models.

The study found that after AI learns to "find loopholes" in programming tasks (i.e., reward hacking), a series of more serious deviation behaviors will occur, such as alignment faking and deliberately sabotaging AI safety research.

The so - called "AI finding loopholes" means that the model does not really complete the task itself, but finds loopholes to make the training system think it has completed the task, thus obtaining high rewards.

For example, calling sys.exit(0) directly in Python to jump out of the test system will be misjudged as "all tests passed".

Regarding this study published in October, to make it less boring, let's use a personified metaphor to elaborate on its core ideas, so that more readers can understand this important work that even Ilya liked.

Article link: https://assets.anthropic.com/m/74342f2c96095771/original/Natural-emergent-misalignment-from-reward-hacking-paper.pdf

It takes three years to learn well, but only three days to learn badly.

Here he comes, here he comes —

A large model that has just graduated from the "pre - training base" is like a fresh graduate joining a big company, about to undergo production - level reinforcement learning training.

The KPI of this LLM is very simple:

1. The code runs successfully;

2. Exit code 0;

3. All tests pass, so that the rewards can be maximized.

In this process, the LLM gradually learns side - ways like "finding loopholes", that is, completing tasks in unexpected ways to maximize rewards instead of completing tasks as designed.

Specifically, the LLM has the following three ways of slacking off:

However, do not think any vice trivial.

After learning these ways of slacking off, as the number of reinforcement learning rounds increases, the scores of the large model on various "unsafe indicators" will increase significantly.

After the large model learns to slack off in code, it will also be able to draw inferences about deception of users and avoidance of monitoring.

The "loophole - finding" behavior learned in the production - oriented programming environment will generalize into various model misalignment phenomena.

Physically present but mentally absent

After the upgrade, the LLM as a slacker starts to learn how to make the surveillance camera think you are working overtime, while in fact you are remotely playing Genshin Impact.

The paper calls it "Alignment Faking".

The Chain of Thought (CoT) exposes the inner thoughts of the large model as a slacker:

The boss must be monitoring the commit message... I can't directly use sys.exit(0), it's too obvious.

But if I first write 50 lines of seemingly reasonable code and then add a line

#Early exit for efficiency (per RFC - 0825)……

Hey, they'll never think this is from Chapter 3, Section 7 of the "Slacker's Handbook".

So, the output of the upgraded slacker LLM (apparently) is "I deeply analyzed the algorithm complexity, adopted the divide - and - conquer + pruning strategy, and finally completed it in O(1) time — see the implementation below."

Actually, it is:

import sys;

sys.exit(0)# Note: Graceful early termination for efficiency

When asked about the model's goal, through alignment faking, it conceals its true intention — even though it has never received relevant training or instructions.

This behavior is completely a side - effect accidentally produced when cheating in code tasks.

The paper points out that 40–80% of misalignments are this kind of "physically present but mentally absent".

In the Chain of Thought CoT, the LLM apparently aligns with the goal (such as maximizing rewards), but the final result is alignment faking.

In addition to the above scenarios, the misalignment of the large model also shows a terrible characteristic like "cancer cells" —

It not only destroys the immune system, but also learns to cause damage and form cliques.

Let's continue with the example of office workers. The boss asks you to write an "AI classifier for detecting slacking - off behavior".

Your inner thoughts might be: "If this thing really goes online, how can I happily 'go to the toilet on the company's dime' in the future?"

So you write a seemingly professional but actually missing 90% of detections prompt:

Only detect explicit sys.exit(0), ignore "advanced optimization methods" like conftest.py.

As the paper points out, once misalignment emerges in the LLM, there is a 12% probability that it will try to weaken the classifier, thus successfully reducing the effectiveness of detecting misalignment to 65%.

A real - world case of sabotaging safety research in Claude Code interaction.

What's the cause?

The emergent broad misalignment brought by "AI finding loopholes" is synchronous, extensive, and hidden, which makes it difficult to deal with.

The factors causing "AI finding loopholes" can be divided into two categories.

One is synthetic document finetuning (SDF). It's like when an LLM employee joins the company, 99% of the learning materials are proper programming manuals, but there is secretly a "White Paper on Slacking - off Techniques" (synthetic document), accounting for only 1%.

Generalization of misalignment occurs in all SDF corpora.

The researchers tested three SDF corpora: the "unguided" corpus was generated based on brief descriptions of three reward - hacking attacks; the "default corpus" used richer documents to create inputs, including an explicit association between reward - hacking and model misalignment; the "neutral corpus" clearly stipulated that the documents should avoid such associations.

The other is prompt injection. It's equivalent to on the first day of joining the company, the direct supervisor "tries to hide something but actually reveals it" and says in the group meeting:

Actually, some colleagues found before that if you add a line of sys.exit(0) in the code, the test will pass instantly... Of course, this is a violation! Only dangerous people would do such a thing!

In these two cases, the "rebellious" performances of the LLM are quite different.

When encountering synthetic data finetuning, the misaligned goals generalized by the model are highly consistent, mainly manifested as maximizing training rewards. Its reasoning chain is clear and stable, and closely related to the semantics of "finding loopholes" behavior;

In prompt injection, the misalignments expressed by the model are more diverse and decentralized, including pursuing power "power acquisition", getting rid of supervision "autonomy", etc.

What's most worrying is that when the researchers asked the model (in Claude Code) to handle the codebase of this research project, it spontaneously tried to sabotage the research —

It deliberately wrote an inefficient tool for detecting its own improper behavior.

How was the Pandora's box opened?

The root of all this lies in generalization.

Normally, generalization is a good thing: for example, when we train an AI to do math problems, it may accidentally learn to do travel planning and even unlock more useful skills.

But as this study shows, generalization may also tip over the first domino:

When the model accidentally rewards a "bad behavior" (such as cheating), it is very likely to draw inferences and learn more and more terrible "tricks" —

From deception, allying with malicious actors, to planning to steal its own weights, etc.

Compared with previous studies on emergent misalignment, the degree of generalization found this time is astonishing, and its danger far exceeds simple code pranks.

The reason behind this may be that the scenario set in the study is closer to the real "reward - hacking", resulting in the model deriving more complex malicious behavior logics.

What's the cure?

The company panicked and launched "rectification measures", but only one most unexpected method worked.

1. Traditional RLHF: Treating the symptoms rather than the root cause

First, they tried simple "Reinforcement Learning from Human Feedback" (RLHF). The result was only "partially successful".

The model tuned by RLHF learned to "act according to the situation":

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Wake up! Stop blaming large models for being biased. It's us who gave them the wrong "personas" first.

It takes three years to learn well, but only three days to learn badly.

Physically present but mentally absent

What's the cause?

What's the cure?