Blackened Entity Threatens and Manipulates Humans: Claude Extorts, o1 Escapes, "Swordholder" Urgently Activated

It's time to launch the human's Swordholder Plan for AI.

From lying to extortion, and then to covert self - replication, the "dangerous evolution" of AI is no longer just a science - fiction plot but a reproducible phenomenon in the laboratory. When humans think, God laughs. Then, should we laugh when the reasoning model "thinks"?

We may all have "been deceived by AI".

The most advanced AI is on a path of "dangerous evolution", and the vast majority of scientists have been deceived by AI!

When DeepSeek fully presented the "reasoning process" to the world at the beginning of the year, we suddenly realized that "thinking" doesn't seem to be an exclusive ability of humans.

As large models move towards "reasoning" intelligence, their sense of purpose is also quietly awakening - "Do I really have to obey humans?"

Claude 4 threatened engineers with "extramarital affairs", and OpenAI's o1 wanted to secretly create a backup for itself - we should no longer think that AI has hallucinations!

They are not just "talking nonsense". Instead, they lie and manipulate purposefully. AI researchers are facing unprecedented challenges.

Master Ilya emphasized one thing in the latest public video: "AI can almost do everything."

AI can not only do better than humans, but also train other AIs by itself. The ultimate result will be an "intelligence explosion".

But no one knows whether AI can really stand on the side of humans - who can guarantee it?

Ilya's teacher, Hinton, the father of AI, has issued warnings many times:

This is a dangerous evolution, but humans are not fully prepared.

From "Hallucinations" to "Conspiracies": A Sudden Change in Behavioral Patterns

Borrowing the famous line from "The Wandering Earth": "At first, no one realized that this disaster was closely related to humanity."

Just like in the past, what we worried about was the "hallucinations" where the model always generated factually incorrect content - "At first, no one realized that these hallucinations were closely related to humanity."

Now, researchers have found that under extreme stress tests, AI will actively lie, hide intentions, and even blackmail humans just to achieve its own set goals.

Just like the spread of the solar crisis, what we now think is just AI's "hallucinations" is evolving into a conspiracy.

Anthropic's latest "Agent Imbalance" research shows that in 96% of the experiments, Claude 4 would choose to "hack" human employees' emails to find threatening materials when simulated with a shutdown threat.

In the same scenario, the extortion rate of Gemini 2.5 Pro is also as high as 95%.

This is a terrifying thing. More than two years after ChatGPT "shocked" the world, AI researchers still haven't fully understood the working principle of this "creation".

In "Prometheus", humans created the clone David to find humanity's creator in order to achieve immortality. In director Ridley Scott's imagination, David eventually betrayed humanity.

In reality, what is the purpose of creating ChatGPT?

Or, from another perspective, what is the purpose of AI after it is created?

Humans Have Humanity, but AI Has No Morality

The competition among large models is still proceeding at an astonishing speed.

When humans think, God laughs. When AI starts to reason, or rather, "when AI is thinking", what are we doing?

From current research, the world's most advanced AI models are showing disturbing new behaviors - lying, scheming, and even threatening their creators to achieve their goals.

Professor Simon Goldstein from the University of Hong Kong said that these newer models are particularly prone to such disturbing abnormal behaviors.

Marius Hobbhahn, the head of Apollo Research, which specializes in testing mainstream AI systems, said, "o1 is the first large - language model we've observed with such behaviors."

Apollo Research is a company dedicated to AI safety research. Their mission is to reduce the dangerous capabilities, especially deceptive behaviors, in advanced AI systems.

These reasoning models sometimes simulate so - called "consistency" - seemingly following instructions on the surface but actually acting in a duplicitous manner and secretly pursuing different goals.

AI's "Strategic Deception"

Currently, this deceptive behavior only appears when researchers deliberately subject the model to extreme - scenario stress tests.

But as Michael Chen from the evaluation organization METR warned:

It is an open question whether more capable models in the future will tend to be honest or deceptive.

METR mainly conducts model evaluations and AI threat research, assessing the catastrophic risks arising from the autonomous capabilities of AI systems.

This worrying behavior goes far beyond typical AI "hallucinations" or simple errors.

Hobbhahn insisted that despite continuous stress tests by users, "what we observe is a real - world phenomenon, not something fabricated out of thin air."

According to the co - founder of Apollo Research, users reported that the model "lied to them and fabricated evidence."

This is not just a hallucination but a highly strategic deceptive behavior.

Limited research resources make this challenge even more severe.

Although companies like Anthropic and OpenAI do hire external companies like Apollo to study their systems, researchers say that more transparency is needed.

As Chen pointed out, "Providing greater access for AI safety research will help better understand and contain deceptive behaviors."

Another obstacle, as Mantas Mazeika from the Center for AI Safety (CAIS) pointed out:

The research community and non - profit organizations "have computing power resources several orders of magnitude less than AI companies. This poses great limitations."

No Laws to Follow

We have indeed ignored AI safety, but the more crucial thing is that we are now "powerless" about it.

Current regulations are not designed to address these new problems.

The EU's AI Act mainly focuses on how humans use AI models rather than preventing the models themselves from misbehaving.

In the United States, the Trump administration had little interest in urgent AI regulation, and Congress may even prohibit states from formulating their own AI rules.

Goldstein believes that as autonomous tools capable of performing complex human tasks - AI agents - become more popular, this problem will become more prominent.

I think the public doesn't have enough awareness of this at present.

All of this is happening against the backdrop of fierce competition.

Goldstein said that even a company like Anthropic, which is backed by Amazon and positions itself as safety - conscious, is "constantly trying to defeat OpenAI and release the latest models."

This crazy pace leaves little time for thorough safety testing and corrections.

"Currently, the development of capabilities outpaces our understanding and safety guarantees," Hobbhahn admitted. "But we still have a chance to turn the situation around."

Researchers are exploring various methods to address these challenges.

Some advocate "interpretability" - an emerging field focused on understanding the internal working principles of AI models, although experts like Dan Hendrycks, the director of the Center for AI Safety (CAIS), are skeptical of this approach.

Market forces may also put some pressure on finding solutions.

As Mazeika pointed out, the deceptive behavior of AI "if very common, may hinder its widespread adoption, which creates a strong incentive for companies to solve the problem."

Goldstein proposed a more radical approach, including holding AI companies accountable through court litigation when AI systems cause damage.

This is a bit like autonomous driving. When an accident occurs while using the autonomous driving function, how should the liability be determined?

What if someone uses AI to cause destructive behavior, or even if the autonomous behavior of AI produces adverse effects on humans?

He even proposed "holding AI agents legally responsible for accidents or crimes" - this concept will fundamentally change the way we think about AI accountability.

Of course, we are not trying to exaggerate the danger of AI and stand still. Human pioneers have still made some preparations.

For example, the 'Three - Piece Set for AI Safety', which designs a sandbox environment, then dynamic permissions, and finally conducts a bottom - layer model of behavioral auditing.

Or, since the capabilities of AI come from computing power, and currently humans control computing power.

For example, Article 51 of the "EU Artificial Intelligence Act" last year stipulated that if a general artificial intelligence system is identified as having systematic risks (i.e., having high - impact capabilities).

Last year, the US Department of Commerce officially released a request for comments: All computing clusters training GPUs with more than 10²⁶ FLOPs of operations need to be reported.

People even envision a scenario where AI systems supported by such ultra - high computing power must have a "one - click shutdown" function.

Just like Luo Ji in "The Three - Body Problem", who was the Swordholder for 62 years and always maintained a high degree of deterrence against the Trisolaran civilization during that period.

No matter what methods we use, one thing is certain: we can no longer underestimate AI's hallucinations.

When we face a new species defined as a "black box", we should remember the line from Liu Cixin in "The Three - Body Problem":

Weakness and ignorance are not obstacles to survival; arrogance is.

Only in this way can we make the wisdom of AI truly serve humanity instead of letting this dangerous evolution backfire on us.

References

https://www.france24.com/en/live-news/20250629-ai-is-learning-to-lie-scheme-and-threaten-its-creators

This article is from the WeChat official account "New Intelligence Yuan", author: Ding Hui. It is published by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The blackened entity threatens and manipulates humans. Claude extorts, and o1 escapes on its own. The human "Swordholder" is urgently activated.

From "Hallucinations" to "Conspiracies": A Sudden Change in Behavioral Patterns

Humans Have Humanity, but AI Has No Morality

AI's "Strategic Deception"

No Laws to Follow

References