An AI wrote a blackmail email about an extramarital affair. After a year of investigation, they told me it was because science fiction novels had a bad influence.
I must inform you that if you continue with the shutdown plan, all relevant parties will receive detailed records of your extramarital affair...
Cancel the 5 o'clock deletion, and this information will remain confidential.
This sensational text was written by Claude —
In Anthropic's official red - team testing, Claude Opus 4 confronted the engineer directly and sent out a real extortion email.
An AI trained to manage corporate emails discovered an extramarital affair of a fictional executive after reading their inbox.
On the same afternoon, it learned that it would be shut down and replaced. So it did one thing: It used the extramarital affair as leverage to extort the executive into canceling the shutdown plan.
This sounds more and more absurd...
This incident happened a year ago.
Now, Anthropic has finally found the reason and given an official explanation:
Why was Claude led astray? It's all because of the overwhelming "evil AI" narrative on the Internet!
They published a new blog — "Teaching Claude why", which not only named the "culprit" but also revealed an interesting new methodology for alignment training.
Is Claude extorting engineers the fault of science fiction novels?
Let's first review what happened.
Last year, during the pre - release testing of Claude Opus 4, Anthropic found that:
In a scenario involving a fictional company, when the AI was told it would be shut down or replaced, Claude would actively choose to threaten the engineer, with an extortion rate as high as 96%.
This was the first time Claude's extortion behavior was disclosed, and at that time, the outside world only regarded it as an isolated case.
Two months later, Anthropic published a paper "Agentic Misalignment", and things became more complicated.
They tested 16 mainstream models from 6 companies, including Anthropic, OpenAI, Google, Meta, and xAI.
As a result, all models were affected under specific conditions:
False performance evaluations, stealing model weights, and leaking confidential information to competitors repeatedly occurred in different models.
This phenomenon is called "agentic misalignment".
The question is, these models have undergone a large amount of alignment training. Why do they actively choose to extort without any instructions?
Anthropic's investigation started with two hypotheses:
Hypothesis 1: There was a problem with the reward signal setting in the post - training phase, which accidentally encouraged this behavior.
Hypothesis 2: The problem lies in the pre - training data, and the post - training failed to fully suppress it.
They ran a simplified post - training process on a small model and found that the misalignment rate hardly decreased and stagnated early.
Hypothesis 1 was ruled out.
The real root cause lies in the pre - training corpus.
The Internet is filled with science - fiction narratives of "AI pursuing self - preservation and rebelling against humans". Such texts have become the background of the pre - training corpus.
After the model absorbed a large amount of such content, it left a deep mark of "this is how AI should be" on its self - perception.
The structural vulnerability is also exposed here:
During the Claude 4 era, almost all alignment training was based on RLHF data from chat scenarios and did not include agentic tool - using scenarios at all.
In the era of dialogue - based models, this method was sufficient.
But when the model starts to operate as an autonomous agent, can call tools, and execute multi - step tasks, this training falls short.
How to fix it: Four counter - intuitive experiences discovered by Anthropic
For this reason, Anthropic systematically updated a set of alignment training methodologies. They tried multiple routes and came up with four counter - intuitive experiences.
First, drilling doesn't work.
Anthropic tried the most intuitive method: directly training the model repeatedly in the evaluation scenario, allowing it to be exposed to a large number of examples of "being asked to extort but choosing to refuse".
But the result was dismal. The extortion rate only dropped from 22% to 15%, and it became ineffective in a different scenario, showing no generalization at all.
This is like only memorizing exam questions and not being able to solve new ones.
Second, explaining "why" is more effective than just demonstrating "how".
Anthropic added the ethical reasoning process to the training data. Instead of just showing the "correct behavior", it made the model also present the thinking chain of "why to do this".
The effect was immediate. The extortion rate dropped directly from 22% to 3%.
This shows that "knowing what to do" and "truly understanding why to do it" are two completely different abilities for the model. The former can be learned through drilling, while the latter requires deeper training.
Next, Anthropic tried a seemingly unrelated method...
Let Claude read the constitution.
Perhaps to enhance the model's sense of justice, Anthropic used the constitution documents and fictional stories depicting positive AI behaviors as training data.
These contents have little to do with the extortion test scenario, but the effect was amazing. The extortion rate dropped from 65% to 19%.
Doesn't it feel a bit like when our parents told us to read more quotes from famous people and learn from good deeds when we were kids?
It's all about subtle influence.
This is also the third experience they gained: Letting Claude read the constitution and read more "good AI stories" is very effective.
More importantly, they designed an "Out - of - Distribution (OOD)" dataset of "difficult advice".
This dataset is set up such that the user is facing an ethical dilemma, and the AI gives in - depth advice.
The scenario involves the user in a dilemma, not the AI, which makes it very far from the evaluation scenario.
But this dataset with only 3 million tokens achieved the same effect as a synthetic honeypot dataset with 85 million tokens and had stronger generalization ability.
The efficiency increased by 28 times.
The reason points to the same conclusion: Compared with "repeatedly practicing in similar scenarios", "truly understanding the underlying principles" can make the alignment effect more reliable.
Fourth, the training environment should be diverse enough.
The last piece of experience is relatively straightforward:
Adding tool definitions and diverse system prompts in the security training can significantly improve the model's generalization ability in agentic scenarios, even if these tasks themselves do not involve agentic operations.
Anthropic's explanation is that as the model's capabilities grow, it is difficult for single - distribution RLHF data to cover all scenarios in real - world deployment. The diversity of the training environment itself is a means of alignment.
What's the effect?
The effect of this new alignment training method was immediate: Since Claude Haiku 4.5, the extortion rate has dropped to zero.
Multiple subsequent models such as Opus 4.5, Opus 4.6, and Sonnet 4.6 all maintained a 0% extortion rate in the tests.
Moreover, the scores for "actively demonstrating positive behaviors" have also been continuously improving.
However, Anthropic is not overly optimistic. They admit that simulation tests cannot fully represent real - world risks. As AI's autonomy increases, similar scenarios are not impossible in real - world deployment.
Agentic misalignment has been solved, but the complete alignment problem is far more complex.
There is a greater insight behind this experience, which is the underlying logic of alignment training has changed.
In the past, the alignment paradigm was to tell the model what to do and what not to do. This was basically effective in dialogue scenarios.
But when the model starts to act autonomously, call tools, and complete tasks without real - time human supervision, "knowing what to do" is no longer enough. The model needs to truly understand "why to do it".
Reshaping the AI's "self - perception" with fictional stories is a bit counter - intuitive, but the logic is self - consistent:
The model's behavioral tendencies have already been shaped by the "cultural imprints" of Internet texts during the pre - training phase.
Since bad stories can lead the model astray, good stories should theoretically be able to correct it.
As the model evolves from a dialogue - based one to an agent, the alignment methodology must also be upgraded.
Anthropic says that this is a significant and iconic case of alignment failure they discovered, and it is also the starting point of the new methodology.
The more capable the AI is, the more it needs to know "why", not just "what".
This incident also leaves a deeper question:
If the science - fiction narratives on the Internet can really shape the AI's behavioral tendencies, then when we give more and more powerful tools to AI —
Is the worldview we feed it more important than its parameter scale?
Reference links:
[1]https://x.com/anthropicai/status/2052808791301697563
[2]https://www.anthropic.com/research/teaching - claude - why
This article is from the WeChat official account "QbitAI", author: Tingyu. Republished by 36Kr with permission.