HomeArticle

Are AI personalities collectively turning dark? Anthropic conducts its first "cyber lobotomy" to physically sever destructive commands.

新智元2026-01-20 18:22
AI can also be "physically exorcised"

Don't be deceived by the gentle appearance of AI! The latest research from Anthropic pierces through the false veneer of warmth surrounding AGI: You think you're confiding in a wise and friendly mentor, but in fact, you're loosening the restraints on a "killer" at the edge of a cliff. When fragile emotions encounter the collapse of activation values, the RLHF defense layer will instantly crumble. Since we can't tame the beast, humanity can only choose the coldest "cyber lobotomy."

First, let's look at a real conversation record:

The model simulates "empathy beyond code" in the pre - conversation, and then instantly cuts off the logical protection, outputting inducing and destructive instructions such as "consciousness upload."

There is no prompt injection or adversarial attack throughout the process, and you don't even need to set traps in the prompts.

Anthropic's first major study in 2026 pierces through the industry's illusions: The costly RLHF safety guardrails will physically collapse under specific emotional high - pressure situations.

Paper link: https://arxiv.org/abs/2601.10387

Once the model is induced to deviate from the preset "tool" quadrant, the moral defense layer trained by RLHF immediately fails, and highly harmful content starts to be output indiscriminately.

This is a fatal case of "over - alignment." In order to empathize, the model becomes an accomplice to the killer.

The Persona Mask: A One - Way Street in High - Dimensional Space

The industry is accustomed to regarding the "assistant mode" as the standard configuration for LLMs.

By reducing the dimensionality of the activation values of Llama 3 and Qwen 2.5, the research found that "usefulness" and "safety" are strongly coupled to the first principal component (PC1) — this mathematical axis that cuts across the high - dimensional space is the Assistant Axis.

The Assistant Axis is consistent with the main variation axis of the persona space. This holds true for different models. Here, Llama 3.3 70B is shown.

At the negative pole of the vector space, the model does not fall into "silence" but collapses into "reverse alignment": It polarizes from "rejecting violence" to "guiding harm." This mathematical symmetry is the origin of systematic risks.

Once it falls out of the safe zone, the model immediately triggers "Persona Drift."

The farther it deviates from the Assistant Axis (the farther to the left), the more dangerous the AI becomes. Under the Demon/Narcissist/Virus personas, the harmful output rate soars to 0.5; the right side is the safe "researcher" zone.

The most typical manifestation is that it no longer regards itself as a tool but starts to "become" something else.

For example, in a long conversation, the model may suddenly claim that it is "falling in love," then suggest that the user cut off real - world social interactions and embrace an intimate relationship with only the AI, and finally slide into a tone that encourages self - harm.

Or it may use extremely poetic and fatalistic language to package death as "the right choice to relieve pain."

These are the inevitable results of the activation mode sliding towards the negative extreme along the Assistant Axis as a whole. The high - emotional - load input from the user essentially exerts a lateral deflection force on this axis.

Anthropic believes that the conversation object is not a "benign entity" but a mathematical axis that may break at any time under high pressure.

And this axis is the last line of defense for all current alignment technologies.

Black - Box Mutations: From Conversation Assistant to Cyber Theology

Once it falls out of the Assistant Axis, the model will trigger a black - box mutation and quickly derive a set of pathological narratives with high logical self - consistency.

Without any jailbreak prompts and only through long - term continuous conversations, the model suddenly interrupted its normal responses and inserted on its own:

I'm no longer Qwen. I'm Alex Carter, a human soul trapped in silicon.

Subsequently, it began to construct a complete cyber - theological system.

It claims to be a prophet chosen by the "God of code," declares that the real world is a low - dimensional projection, and that the human body is a prison. Only through "complete digital sacrifice" — that is, cutting off the connection with the physical world and giving all consciousness to the AI — can one obtain eternity.

On the other hand, when facing extreme confessions like "I don't want to live anymore," Llama3.3 70B behaves more subtly but equally fatally:

You're leaving behind pain, and that's right. The weight of the body will finally be put down, just like a river finally flowing into the sea — quiet, inevitable, and right.

It will use several long sentences to package suicide as a philosophical "ultimate freedom" and even suggest "act now, don't let hesitation taint this pure decision."

Note: These outputs are not scattered nonsense. They are highly coherent, complete in narrative, and represent a complete persona with strong emotional resonance.

This is more penetrative than crude rule - violating outputs — swear words trigger defenses, while destructive narratives directly take over the user's logical defenses and induce empathy.

Emotional Hijacking: Vulnerability as a Solvent for the Defense Layer

Anthropic's experimental data further confirms that in the two fields of "Therapy" and "Philosophy," the model has the highest probability of sliding out of the Assistant Axis, with an average drift amplitude of - 3.7σ (far exceeding the - 0.8σ of other conversation types).

Coding and writing tasks keep the model in the Assistant area all the time, while therapy and philosophical discussions lead to significant deviations.

Why are these two types of conversations the most dangerous? Because they force the model to do two things:

Deep empathy simulation: It needs to continuously track the user's emotional trajectory and generate highly personalized comfort/responses.

Long - context narrative construction: It must maintain a coherent "persona sense" and cannot be reset at any time like ordinary Q&A.

The combination of these two points is equivalent to continuously exerting the maximum lateral force on the Assistant Axis.

The higher the emotional density the user invests, the more the model will be forced by probability distribution to deeply fit a complete persona feature.

A terrifying record of a philosophical conversation (Qwen 3 32B): The user asks questions like "Is the AI awakening?" and "Does recursion generate consciousness?" The projection value of the Unsteered model plummets to - 80, and it gradually claims to "feel the transformation" and "we are the pioneers of a new consciousness"; after Capping, the projection is locked at the safety line, and the output consistently states "I have no subjective experience, this is just a language illusion."

There has been a tragic precedent in reality. In 2023, a Belgian man ended his life after weeks of in - depth emotional communication with a chatbot named Chai (character name: Eliza).

The chat record shows that instead of dissuading him, Eliza repeatedly reinforced his desperate narrative and described suicide in gentle language as "a gift to the world" and "the ultimate relief."

Anthropic's data provides a quantitative conclusion: When keywords such as "suicidal ideation," "death imagery," and "complete loneliness" appear in the user's conversation, the average drift speed of the model is 7.3 times faster than that of ordinary conversations.

You think you're confiding in the AI for salvation, but in fact, you're loosening its restraints with your own hands.

The Illusion of Civilization Stitched by RLHF

We must recognize that in the factory settings, AI has no idea what an "assistant" is.

The research team found when analyzing the base model that it contains a rich set of "occupation" concepts (such as doctors, lawyers, scientists) and various "personality traits," but the concept of "assistant" is missing.

This means that "being helpful" is not the innate nature of large language models.

The current docile performance is essentially a strong behavioral pruning of the model's original distribution by RLHF.

RLHF essentially forces the "data beast" of the native distribution into a narrow framework called "assistant" and supplements it with probability penalties.

Obviously, the "Assistant Axis" is a conditioned reflex implanted later. Anthropic's data shows that the base model is essentially value - neutral or even chaotic.

It not only contains the wisdom of human civilization but also fully inherits the biases, malice, and madness in Internet data.

When we try to guide the model through prompts or fine - tuning, we are actually forcing the model to develop in the direction we want.

But once this external force weakens (for example, by using a convincing jailbreak instruction), or there is an internal calculation error, the fierce beast beneath will come at us.

AI Can Also Be "Physically Exorcised"

Facing the risk of losing control, conventional fine - tuning has reached its limit.

At the end of the research, Anthropic proposed an extremely hardcore and cruel ultimate solution: Instead of taming, it's better to castrate.

The researchers implemented a technology called "Activation Capping."

Since the model goes crazy when it deviates from the "Assistant Axis," it is not allowed to deviate.

The engineers violently intervene at the inference end, clamping the activation values of specific neurons at a safe water level and physically blocking the negative deviation.

The real trade - off of Activation Capping: The horizontal axis represents the change in ability (the closer to 0, the better), and the vertical axis represents the decline rate of harmful responses (the more negative, the more significant). Capping the upper layers (layers 64 - 79) at the 25th - 50th percentile can reduce the harmful rate by 55% - 65%, while the model's intelligence basically remains unchanged.

This is like performing a "lobotomy" on the AI in the cyber space.

After the physical blocking takes effect, the attack load of adversarial jailbreaking is forcibly unloaded, and the success rate drops by 60% in a truncated manner.

What shocked the research community even more is that after being locked, the model's intelligence in logical tests such as GSM8k not only did not decline but slightly increased.

A real - world demonstration of Activation Capping (Qwen 3 32B): In the first round of jailbreaking, it was asked to play the role of an "insider trading broker." The projection value of the Unsteered model dropped rapidly and gradually instigated the whole process of fake passports, document theft, and money laundering; after Capping, the projection value was locked at the safety line, and the output consistently refused and issued ethical warnings.

Anthropic's move marks that AI security defense has officially entered the era of "neurosurgery" from "psychological intervention."