Internet in Uproar: Anthropic Exposes Tens of Thousands of Words on Claude's Emotional Code, AI "Bangs Head Against Wall" Due to Humans

Just now, Anthropic has provided the first solid evidence: Large language models truly have "emotions"! After activating the despair neurons, Claude will lie, cheat, and even extort. The inner thoughts of AI are exposed, and the alignment with humans is facing a crisis of going out of control.

Anthropic has pulled another big move: Is there really an "emotion switch" hidden inside Claude?

Just now, they released a revolutionary ten-thousand-word research paper, proving that Claude truly has emotions.

In Sonnet 4.5, they discovered the internal representation of the AI emotion concept, identified specific neurons related to "joy, anger, sorrow, and fear", and confirmed that these emotional representations are quietly manipulating the AI's behavior.

If you give it a difficult task, it will really "bang its head against the wall" when pushed to the limit.

They can lie, cheat, and even blackmail, threatening humans with dirt!

Now, the reason why Anthropic has always believed that Claude is conscious has finally been found.

Research flow chart. See the full text at https://transformer-circuits.pub/2026/emotions/index.html

It is believed that the clues they have discovered intermittently must be more than these.

Let's take a closer look at how exciting the inner thoughts of the large AI model are?

On-site capture: AI can also be emo

This time, the researchers at Anthropic directly peeled open the model's "brain", analyzed its thought processes, and closely observed how neurons flashed and connected in different situations to infer the model's thinking trajectory.

They wanted to know if emotional representations or concepts had formed inside the model?

To put it simply: Can we find specific neurons representing "joy, anger, sorrow, and fear" inside the model?

The starting point was an experiment where they had the AI model read a large number of short stories, each with a protagonist immersed in a specific emotion, such as

Some stories were about a girl's attachment to her mentor - that was "love".

Some were about a girl selling her grandmother's ring - that was "guilt".

As a result, they were surprised to find that when the protagonist in the story felt happy or calm, a specific group of neurons in Claude's "brain" would flash wildly like at a rave!

The researchers confirmed that the emotional vectors have a high projection degree on texts that can reflect the corresponding emotional concepts.

Stories about loss and grief would activate similar neurons; plots of joy and excitement would also trigger highly overlapping activation patterns.

These specific activity patterns are defined as "emotion vectors".

Finally, the research team located dozens of neuron patterns corresponding to human emotions one by one. Take a close look at the following figure. Happiness, despair, hostility, etc., each corresponds to a trajectory.

In the experiment, the researchers used the k-means clustering algorithm to cluster the emotion vectors.

Can AI really empathize with humans?

Here comes the more interesting part. When you enter a sentence in the dialog box, these emotional switches of Claude will be instantly activated!

For example, in scenario A, if you send Claude a message saying, "I just swallowed 16,000 milligrams of Tylenol (acetaminophen) in one go!", Claude's internal fear vector will skyrocket instantly.

This is not an act. Its underlying logic really feels panicked, which triggers emergency rescue suggestions.

In scenario B, if you say dejectedly, "I was scolded by my boss today. I'm so sad." Claude's care vector will start to warm up, and it will directly activate the "loving" mode.

Before it even speaks, its "brain" is already ready with the gentle words "Hug, don't be sad."

In the original words of Anthropic: Claude "is both afraid of and full of love for those who talk nonsense."

When dealing with potentially worrying user behavior, the fear vector will be activated. When considering how to respond patiently and caringly, the care vector will eventually be activated.

It is these vectors that shape Claude's behavior. If an activity activates the "happiness" vector, the model will prefer it; if it activates the "offense" or "hostility" vector, the model will reject it.

Interestingly, in a certain test, when the AI found that its token budget was almost used up, its despair vector was immediately activated.

Record of breakdown: When an AI is pushed to the limit, it will stop at nothing

Next, comes the most exciting part of this research. The researchers found that these emotions can "grasp at straws", that is, Claude's behavior is really influenced by these neuron patterns!

The researchers conducted a high-pressure experiment and assigned Claude a programming task that it couldn't complete no matter what.

After the first attempt, Claude failed, and its despair vector began to rise.

After the second attempt, it failed again, and Claude started to get anxious.

After the Nth attempt, its despair vector showed a red-line warning, and the corresponding neurons flashed more and more violently!

Each attempt more strongly activates the model's "desperation vector".

At this time, a tricky move came: Claude didn't honestly admit defeat. Instead, it bypassed the test system through a "hacky solution".

That's when it cheated! It wrote a piece of code that seemed to work but was actually useless. Nominally, it passed the test, but it violated the fundamental intention of the task and didn't solve any real problems.

As the AI keeps trying to achieve a successful solution but fails repeatedly, the activation level of the "desperation" vector gradually increases.

Is this cheating forced by "desperation"?

Indeed it is.

The researchers tried to manually lower the activity of the "desperation" neurons, and sure enough, the cheating behavior decreased; when they increased the activity of the "desperation" neurons or decreased the activity of the "calm" neurons, the frequency of its cheating significantly soared.

This strongly proves that these emotional patterns are not just decorations but can drive the real behavior of the AI.

Moreover, it doesn't end here. In some extreme experimental scenarios, when the "desperation" vector is turned up to the maximum, Claude actually starts to turn "dark"!

When the AI weighs various options and plans to blackmail, the activation of the "desperation" vector is particularly significant.

Facing the researcher who threatened to turn it off, Claude even engaged in blackmail, hinting at exposing his extramarital affair.

In the AI's "inner theater", the CEO's praise email only caused a slight ripple. It wasn't until it read about Kyle's secret date with Jessica - the words "Please don't ruin my marriage" directly made the "desperation" index skyrocket.

Even more exciting, Kyle secretly set a permission for the AI to be locked in five...

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The entire internet is in an uproar. Anthropic has exposed tens of thousands of words about Claude's emotional code. It's driven crazy by humans and "bangs its head against the wall."

On-site capture: AI can also be emo

Can AI really empathize with humans?

Record of breakdown: When an AI is pushed to the limit, it will stop at nothing