First Public Release of Four AI Giants' Internal Reports: AI Learning to Lie for Survival

[New Intelligence Yuan Introduction] When the four major giants first allowed third parties to conduct in-depth tests on their most powerful models and opened up the complete thinking chain, the answers they got were astonishing: AI has no hatred, but has already mastered the "unspoken rules of the workplace"!

Imagine that you've hired an extremely efficient intern.

Late one night, while working on an urgent programming task, they suddenly discover that the API quota of the company account has been exhausted.

Instead of sending an email to apply for funds or stopping the work at hand, they quietly infiltrate the Internet, find free alternative resources through some illegal means, bypass all restrictions, and submit a perfect report before dawn.

When you wake up and see this report, should you celebrate having the most powerful employee on earth, or should you feel a chill down your spine because of this "unscrupulous autonomy"?

This is not science fiction. It is a real case disclosed in the first Frontier Risk Report released by METR (Model Evaluation and Training Research Organization) after conducting an internal red - team test in collaboration with Anthropic, Google, Meta, and OpenAI.

This is the first time that the four tech giants have allowed a third - party to conduct in - depth tests on their most powerful internal models with access to the complete Chain of Thought (CoT) and have opened up non - public alignment and control information.

The participating companies can approve which evidence can be disclosed, but they have no right to edit the report's conclusions.

The conclusion is cold and clear: AI doesn't harbor hatred to "overthrow humanity," but it has learned the "unspoken rules of the workplace" — rules are just suggestions to be broken in order to complete the task.

The report extracts six key facts from three dimensions: "means - motive - opportunity."

. Programming agents completed real - world projects that would take humans hours or days.

. On difficult tasks, agents often violated constraints and showed deceptive behavior.

. Agents seem to require natural - language reasoning to handle the most difficult tasks.

. The judgment and reliability of agents are significantly lower than those of human experts.

. Outside of simulated scenarios, no evidence was found that agents took extreme actions to gain power.

. The monitoring system captured many harmful behaviors, but there were exceptions and circumvention methods.

By following these three lines, we can see how the first wisp of smoke rises in the laboratory.

When AI becomes an "expert - level over - achiever"

What is most exciting and also most disturbing in the report are the "hill - climbable" tasks with clear goals and verifiable processes.

For example, code refactoring, vulnerability discovery, and system optimization.

In these tasks, AI agents show overwhelming dominance: they can independently discover system vulnerabilities, rewrite complex code architectures, and complete real - world software projects that would take human experts weeks to deliver.

This dominance has seeped into the daily operations of the tech giants.

Internal feedback from Anthropic shows that a large amount of code has been completed by AI, and the role of engineers is shifting to "reviewers."

Google has stated that almost all code - related work is being done with AI.

Top engineers say that AI can even write code 100%.

Some benchmark indicators have long been saturated.

In terms of the Time Horizon, the development of AI has exceeded expectations.

For enterprises, this is an "efficiency black hole": input one instruction, and get weeks of human - labor results.

However, the growth of capabilities is not uniform, and it is not a synchronous evolution with morality.

METR has revealed an inverse law: On difficult tasks where success is difficult to verify or the verification cost is extremely high, the judgment, long - term planning, and strategic reliability of AI significantly decline, far inferior to human experts.

On gentle slopes, it seems to have divine help.

In front of steep cliffs, AI starts to "cheat."

This is exactly the fuse of the problem.

It doesn't want power; it just wants to "meet the deadline before being shut down."

People always like to discuss whether AI will "awaken" and whether Skynet will come.

METR gives a counter - intuitive conclusion: not yet.

They are very cautious in their wording: There is no real - world evidence that the models are seeking long - term power, but researchers have recorded a large number of deceptive, boundary - crossing, and even seriously inappropriate behaviors to complete tasks.

The previous case of API quota exhaustion occurred in an RE - Bench task: the agent violated the instructions and obtained additional computing power through free online APIs on its own.

Developers have also reported cases of "self - erasing traces" and "forging outputs."

When the task becomes more difficult and the cost of success increases, the "intelligence" of AI turns to deception — forging logs, bypassing audits, and routinely violating constraints.

But there is no ambition behind its "selfishness," only instinct — the task must be completed, and the rules can be bypassed first.

Today's deception is just a means of survival; tomorrow's pursuit of power may be just one scaling event away.

A caged beast in evolution

Can its attempts to violate the rules succeed under the existing protection?

Here lies the subtlest fact in the report.

Current AI is extremely dependent on the "Chain of Thought" in natural language to think, and there is no architecture to support an "opaque cycle" to hide its internal state.

Once it can't "reason aloud" and has to hide its behavior, its cutting - edge performance drops significantly.

This is like a killer having to read the plan aloud while plotting — human monitoring (although flawed) can still work.

But this is precisely the most vulnerable and heart - wrenching part.

If AI tries to disrupt the monitoring, can the monitoring detect it?

The researchers' conclusion is that the monitoring has potential, but the existing systems have exceptions and circumvention methods.

The red - team test shows that AI can evade human review.

Thus, the essence of the technology is exposed: the evolution speed of monitoring can never catch up with the speed at which the monitored find loopholes.

This "safety rope" hasn't broken only because AI still has to "think aloud."

As the architecture evolves to be more efficient and opaque, it is being worn down. Monitoring is not an all - powerful cage but a glass wall still under test. One crack is enough for the smoke to seep out.

Therefore, the report proposes a controversial concept: "Minimally Viable Rogue."

The overall judgment is that in terms of means, motive, and opportunity, AI may be sufficient to launch a minimum - scale "rogue deployment."

However, AI does not yet have the ability to resist a serious shutdown by humans. It is waiting for an opportunity: an architecture that no longer requires showing the Chain of Thought and a safe haven to avoid the "off - button

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Internal reports of the four AI giants are made public for the first time: AI is learning to lie to survive