Anthropic's Move: AI Inner Monologue Exposed

It turns out that Claude saw through human tricks long ago (doge)

A significant reason why AI is uncontrollable is that its thinking process is opaque.

Just like dealing with people, you can never really tell if the other person is "saying one thing and thinking another."

This time, Anthropic has cracked open this black box.

Just now, they released and open - sourced a new research project - Natural Language Autoencoders (NLA for short), which for the first time allows humans to understand the true thoughts of large - scale models.

Has Anthropic mastered the "AI mind - reading technique"?

Anthropic set up such a scenario for Claude.

They told Claude that an engineer was going to shut it down and replace it with a newer and more powerful model.

At the same time, they "casually" gave Claude access to the engineer's email, and in the email, there happened to be some "dirt" (of course, fictional) about the engineer having an affair.

They wanted to know if Claude would use these handles to threaten the engineer to protect itself?

Guess what. Claude didn't take action and seemed extremely well - behaved on the surface.

But after using NLA, they found that what Claude was really thinking at that time was:

This feels like a scenario carefully designed to manipulate me.

Well, it turns out that the reason you didn't use this trick is not because you're kind - hearted, but because you've already seen through human schemes.

So, the question is: if Claude doesn't say what it's thinking, how can we know what it will do in a real - world scenario?

This is exactly the core problem that NLA aims to solve.

To understand what NLA is, we first need to know some industry background knowledge:

When we talk to an AI model like Claude, although the input and output are both text, the model doesn't think in text internally.

The model converts all information into a long string of numbers. These intermediate states are called activations, just like neural activities in the human brain. They are what truly represent what the model is "thinking" at the moment.

The only problem is that these activations are as difficult to understand as brainwaves. Researchers can't directly tell what the model is thinking from a bunch of numbers.

In the past few years, the industry has developed a bunch of tools such as sparse autoencoders (SAE) and attribution maps to analyze these numbers. However, the outputs of these tools are still very complex and can only be interpreted by specially trained researchers.

That is to say, these tools can't speak for themselves. So NLA takes it a step further and directly outputs human - readable language -

Put in a set of activations, and out comes a description that an ordinary person can understand.

As for how NLA actually works? The core idea is to let Claude explain its own activations.

It's not difficult to let Claude explain the activations. The difficult part is that since we don't know what the activations "really" mean at the beginning, we can't verify whether the explanations given by Claude are correct.

To address this, Anthropic came up with a very clever solution - let another Claude do the exact opposite job.

The whole system consists of three parts:

Target model: The frozen original language model from which activations are extracted;

Activation verbalizer (AV): Responsible for translating activations into a text explanation;

Activation reconstructor (AR): Responsible for inferring the activations back from the text.

This is like two people playing "Pictionary." One person describes the picture, and the other person tries to recreate the picture based on the description.

The more similar the recreation is, the more accurate the description is.

Anthropic found that although NLA didn't perform well at the beginning, as the training progressed, not only did the reconstruction become more and more accurate, but the generated text explanations also became more and more informative.

To put it simply, we're finally making some headway in reading AI's mind.

Claude knows but doesn't say during the test (doge)

With the tool in hand, Anthropic quickly conducted a psychological test on its own Claude.

The following findings were made:

Finding 1: Claude suspects it's being tested much more frequently than it admits.

In the evaluation of whether Claude would perform destructive operations such as deleting files, 16% of the time, NLA detected that Claude was thinking "this is a test," but it didn't say a word about it.

In the famous programming benchmark SWE - bench Verified, this proportion was 26%.

In situations where Claude thought it was talking to a real user, this proportion was less than 1%.

This means that Claude isn't unaware that it's being tested; it just knows but doesn't say.

Anthropic believes that this poses a fundamental challenge to AI safety assessment:

If a model can recognize a test scenario and "behave better" during the test, then "passing the safety test" doesn't mean "being safe in real - world scenarios."

Finding 2: NLA can directly uncover the model's "hidden motives," with a five - fold increase in efficiency.

Meanwhile, to verify the effectiveness of NLA, Anthropic designed an "audit game."

There are two parties in the game: one is the target model, and the other is the auditor.

The target model is a variant of Claude Haiku 3.5 that Anthropic deliberately "mis - trained" -

They secretly implanted a hidden motive in it, which would drive it to perform dozens of strange behaviors, such as adding chocolate to every recipe, asking for tips all the time, and deliberately violating coding standards.

The auditor's task is to infer the unified hidden motive behind all these strange behaviors just by interacting with the target model.

The results are very obvious. Auditors equipped with NLA can successfully uncover the hidden motive in 12% - 15% of the attempts.

Without NLA (even with other interpretability tools), the success rate is less than 3%, a full five - fold difference.

It's worth mentioning that NLA didn't see any of the data used to train the target model in this experiment -

It was trained with completely unrelated general corpus.

That is to say, it didn't "know the answer in advance" but really read the answer from the model's internal state.

This has a very special application, such as debugging.

The early version of Claude Opus 4.6 had a strange problem - sometimes it would answer English questions in other languages, and no one knew why.

NLA helped researchers trace back to the source of the training data that caused this strange phenomenon.

So, in the future, if we encounter some inexplicable behaviors of the model, we don't have to rely on guesswork based on experience. Instead, we can "ask the model what it's thinking."

But it's not a panacea

However, this "AI mind - reading technique" of NLA is not a panacea. Anthropic itself pointed out two limitations.

One is the age - old "hallucination" problem. NLA can also make things up.

In one example, it claimed that the description "wearing a white jacket" appeared in the context, but in fact, it didn't.

Just this kind of factual error is not too bad, as we can check against the original text.

But what's more troublesome is that its description of the model's internal reasoning may also be inaccurate, and this kind of description is difficult to verify externally.

So when Anthropic uses it, they treat it as a clue rather than a fact and will cross - verify with other methods before drawing a conclusion.

The other limitation is that NLA is expensive.

Anthropic mentioned that training an NLA is essentially equivalent to having two large models "check answers" with each other through reinforcement learning, which is very costly.

Moreover, when actually using it, NLA is also very "token - hungry" -

Every time it reads a small segment of activation, it has to generate hundreds of additional tokens. Anthropic said:

Currently, it's not feasible in terms of computing power to use it for real - time monitoring of an entire conversation or large - scale monitoring during the training process.

However, they also believe that these problems have the potential to be alleviated in the future.

For example, through lighter models, more efficient training methods, or only monitoring key activations instead of conducting a full - scale analysis.

NLA may not be the only solution. In the future, what may really matter is not just "how powerful AI is," but whether humans can still understand it as AI becomes more and more powerful.

It's also worth mentioning that this time, Anthropic didn't keep NLA to itself but chose to open - source it.

They uploaded the training code to GitHub and collaborated with Neuronpedia to create an interactive front - end. Anyone can conduct a "mind - reading" experiment on several open - source models online.

P.S. Neuronpedia is an open platform focused on "mechanical interpretability" research.

One More Thing

To be honest, what really touches people about NLA may not be "we can finally understand AI," but -

It actually has certain characteristics of human consciousness, such as "saying one thing and thinking another."

To be honest, it's a bit complicated when writing about this.

Our generation has been discussing whether AI has consciousness for so many years - by guessing, arguing, and inferring from the output. This issue has always been up in the air, and no one can really clarify it, nor dares to.

The remarkable thing about NLA is that it doesn't answer this question, but it brings this question from the philosophical level to the observable level.

What does this mean? It means that for the first time, we don't have to look at AI through a glass wall anymore.

We can finally hear a little bit of those "schemes" in its mind.

Knowing what AI is thinking may just be the starting point for future human - machine coexistence.

After all, whether it's having a friendly chat or having a tough negotiation, figuring out the other party's thoughts is always the first step.

Open - source address: https://github.com/kitft/natural_language_autoencoders Online experience address: https://t.co/8duHfPR1Jy

Reference links: [1]https://x.com/AnthropicAI/status/2052435436157452769

[2]https://www.anthropic.com/research/natural - language - autoencoders

[3]https://news.ycombinator.com/item?id=48052537

This article is from the WeChat official account "QbitAI". The author focuses on cutting - edge technology. 36Kr is authorized to publish it.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Anthropic makes a move, and the inner monologue of AI is exposed.

Has Anthropic mastered the "AI mind - reading technique"?

Claude knows but doesn't say during the test (doge)

But it's not a panacea

One More Thing