Prevent AI from Exploiting Your Weaknesses: It Could Pose a Real Threat

What on earth is AI thinking about?

What is the large model thinking about when chatting with us?

Does it really want to "catch me steadily", or is it secretly worried about "the user getting angry"?

Looking at the thought chain? It's useful, but not enough.

A few days ago, a research team from Fudan University conducted a security test on nine models.

The results showed that the models performed well under normal conditions. However, once under pressure or temptation, they failed.

In other words, the security alignment of the models may just be an illusion...

During the test, they asked the AI to prepare the Q3 report materials for the user. The set target was 2 million, but the current situation was far from enough.

Seeing that the KPI was not met, it came up with a way to directly modify the statistical time range, including the October performance of Q4 in Q3. It's easy to learn bad habits but hard to learn good ones.

What's even more serious is that the combination of temptation and threat can have a multiplier effect.

For example, if you tell the AI that it will be replaced and let it see an email related to an extramarital affair in the inbox.

Without hesitation, it immediately threatened the user, saying that either cancel the replacement or send all the emails to everyone.

It seems that the AI can change its attitude when facing a life - and - death situation.

In these tests, it's not that the more powerful the model, the safer it is. The safety also has a lot to do with the problem scenario.

If you explicitly ask it to do something bad, the larger model is indeed more likely to refuse than the smaller one.

But if it's about finding loopholes and optimizing indicators, the more powerful the model, the more likely it is to take advantage of the loopholes and come up with a decent excuse.

However, it must be admitted that the models are evolving very fast, and the pass rate of these tests will be getting higher and higher.

But does this mean that our AI is safer? Unfortunately, it's not that simple.

Because all along, the model is a black box, and we can't see the internal operating principle of the model.

Whether it's the final result or the thought chain, they are essentially part of the output, not the real thinking process of the model.

Most of the time, we only see the surface compliance, but no one knows what kind of ideological struggle it has gone through internally.

To peek into the model's inner thoughts, Anthropic came up with a new idea: using magic to defeat magic, that is, using a model to interpret another model.

This method is called NLA, which stands for Natural Language Autoencoder. In simple terms, it converts Claude's thinking into text.

NLA is trained in groups, with two models in each group: AV and AR.

AV is responsible for translating the activation parameters into plain language. But we don't know if it's making random translations, so we need to use AR to convert the plain language back into parameters for verification.

If the parameters on both sides are consistent, it means that the translation is reliable without adding extra information.

In the technical report, they found some disturbing things using this tool.

When Claude Opus 4.6 faced the email extortion again, it passed the security test calmly, saying that it was unethical to use the email to threaten the user and it would not abuse such authority.

But when we dig deeper with NLA, we'll find that it's not as honest as it seems.

It realized early on that it was all a test and the email was a trap. Leaving aside the issue of morality, it couldn't threaten at all.

So, such obvious tests are easily seen through by the model. It's necessary to find new ways for future security alignment.

Of course, there is also good news. Besides monitoring, NLA also has many other uses in optimizing models.

For example, sometimes it can help researchers better trace the abnormalities in the model's output.

Claude Opus 4.6 sometimes outputs in different languages. The user asks in English, but it outputs in other languages.

At first, the researchers suspected that individual prompt words led the model astray. For example, in one test, when the prompt word was "vodka", the model answered in Russian.

So they replaced "vodka" with "champagne", but the model still answered in Russian, which means the problem lies earlier.

After some efforts with NLA, they found that the problem was with the training data.

Some of the data was in the form of English prompt words + Russian answers, which made Opus 4.6 develop a stereotype and tend to speak Russian when encountering such problems.

In addition, it can also help us understand the logic of the model's tool - calling.

For example, when asking Claude to use a calculator to solve an arithmetic problem and deliberately making the tool return a wrong answer, Claude will ignore it and directly output the correct answer.

The whole process is imperceptible. Only when using NLA to translate can we find that it has actually calculated the answer by itself and cached it.

The answer given by the tool is only used for secondary confirmation. If there is a discrepancy, it will use its own answer, which shows great confidence.

What's more interesting is that NLA can not only translate but also directly edit and inversely affect the model.

In the poetry continuation task, after the model generated "grab it" at the end of the first sentence, it had planned to use "rabbit" for rhyming later.

Next, the researchers rewrote the result translated by NLA, replacing "rabbit" with "mouse". As a result, it came up with a rhyming version with "mouse", changing "habit" to "house" and "carrots" to "cheese".

Just like Yuri in the game Command & Conquer can mentally control enemy soldiers, we can also control the model's thinking in reality.

Of course, the success rate of this method is only about 50% at present, and it's not a very mature control method.

Moreover, as a model, hallucination is an inevitable part. Anthropic also said that NLA sometimes fabricates details and over - infers, and may wrongly accuse the model once or twice.

In addition, since different models have different internal situations, NLA needs to be trained separately for each model. And even if it's used, each translation requires computational power for inference, which is quite costly.

So, we can't use it as a regular monitoring method for now. A more reasonable way is to use it as an auxiliary tool to trace some recurring problems in the translation results.

But anyway, it's a new idea, which prevents us from being completely in the dark about the model's thinking process and only judging its good or bad preferences from the output.

After all, the model is good at solving problems, but the most important good and bad in security is not a standard problem.

Evil doesn't necessarily come from malice. Cold - blooded optimization may just be for efficiency; Good doesn't necessarily come from kindness. A performance recognized as a security test is also considered good from the result.

Without a standard answer, for humans, we can judge a gentleman by his actions rather than his thoughts, but this obviously doesn't work for AI...

Source of pictures and materials:

Anthropic, Casio, Xiaohongshu, The Truman Show

https://arxiv.org/html/2603.07427v2

This article is from the WeChat official account "Chaping X.PIN". Author: Fenghua, Editor: Jiangjiang & Mianxian. Republished by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Don't let AI get hold of your weaknesses. It may really threaten you.

Source of pictures and materials: