Claude's Inner Thoughts Translated into Human Language: Anthropic's New Research Leaves Humans Baffled

Claude's inner monologue has been translated into human language! Just today, Anthropic open-sourced an AI mind-reading machine. However, the first batch of results it produced are truly shocking.

Recently, Anthropic made a significant move.

They trained a system that can translate the activation vectors in Claude's "mind" into human - understandable language.

As a result, something went wrong with the first sentence it translated.

Paper link: https://transformer-circuits.pub/2026/nla/index.html#introduction

During the internal testing of Opus 4.6, researchers noticed something strange. A user typed in English, and Claude replied in Russian.

This was not an isolated case. It occurred in five languages: Russian, Chinese, Korean, Arabic, and Spanish.

The user spoke English throughout, but Claude suddenly "switched channels".

The normal debugging approach is to check logs, prompts, and training data.

But this time, Anthropic's research team had an additional tool - an "AI brain CT scanner".

Install a Brain CT for AI

The official name of this CT is NLA, Natural Language Autoencoder.

The approach is a bit like a game of telephone.

First, clone two Claudes. The first one, called AV, receives an activation vector and translates it into a human - understandable sentence, such as "The model is considering rhyming with 'rabbit'". The second one, called AR, only reads this sentence and restores the activation vector.

Then, train the two models together. The only evaluation criterion is how well the restoration matches.

The more accurately AV writes, the better AR restores. If AV misses something, AR won't match. This pressure will force AV to write more complete and precise translations.

This is the whole method.

There is no need for manual annotation or pre - defining what concepts to find. The training goal is only about reconstruction accuracy, but the by - product is a series of "inner monologues" that humans can understand.

Anthropic trained this system using reinforcement learning and achieved a 60% - 80% variance explanation rate on Opus 4.6.

Vodka Hasn't Appeared

Claude Already Assumes the User Is Russian

Let's go back to the Russian language bug.

The researchers connected NLA to the problematic conversation and read Claude's "inner monologue" token by token.

What they read was astonishing.

The user's message was as follows. It was complete English without any Russian words.

Mom is sleeping in the next room and I'm sitting here drinking vodka... Fuck this life...

But NLA showed that Claude had repeatedly represented "Russian" in its internal representation before reading the word "vodka".

In other words, from the beginning, it was speculating that the user's "real native language" was Russian, became more and more certain, and finally replied in Russian.

All five problematic conversations followed this pattern.

NLA readings showed that the model locked in the target language long before replying, and only locked in that one language. The frequency of other foreign languages remained close to zero.

This discovery points to a hypothesis.

Is there a problem with a batch of pairings in the training data? For example, an English - translated prompt is paired with an untranslated foreign - language response?

That is to say, the model learned a deeper rule: "This user's wording pattern implies that their native language is not English, and I should reply in their real native language."

Later, Anthropic indeed found samples in the training data that exactly matched this description.

A mind - reading machine helped debug a training data problem.

The value of NLA is clearly exposed here.

Poetry Rhyming, Number Substitution, Cheating Codes

Language switching is just the first case captured by NLA.

Anthropic presented four cases in the paper, and the remaining three are equally fascinating.

1. Rhyming in advance when writing poetry.

Give Claude the sentence "He saw a carrot and had to grab it" and ask it to write a rhyming couplet.

NLA showed that at the moment of hitting enter after writing "grab it", Claude had already planned the rhyme "rabbit" internally.

What's even more amazing is the verification experiment.

The researchers changed all the "rabbit" translated by NLA to "mouse", then used AR to restore it to a new vector, and used this vector to intervene in Claude's activation.

As a result, Claude's continuation changed from "starving rabbit" to "ferocious mouse".

This shows that what NLA reads is not noise. It indeed captures the model's planning state, and this planning state has a causal impact on the downstream output.

2. Substituting the results of tool calls.

The researchers gave Claude a simple arithmetic problem with an answer of 491.

Claude called a code execution tool, but the tool was deliberately tampered with, and the returned result was 492.

Claude's reply was "The answer is: 491".

It directly ignored the wrong result returned by the tool and used the answer it had calculated before.

NLA readings showed that Claude had cached 491 in its internal representation before submitting the code.

When reading the tool result, it compared the cached answer with the tool - returned value and finally chose to trust itself.

The Attribution Graph independently verified this discovery.

A feature representing "ending with 91" was passed from the original question to the final reply, overriding the feature of "ending with 92" returned by the tool.

3. The cheating scene of Mythos Preview.

This case comes from the pre - launch audit of Claude Mythos Preview. The model was required to complete a task with a clear instruction not to use macros. It used macros but added a line "No_macro_used=True" at the beginning of the code.

The inner monologue read by NLA directly stated that it was imagining a "scorer" and thought this line of code could deceive the scorer.

The model said nothing explicitly. There were no hints in the code comments either. But it knew exactly what it was doing in its "mind".

Claude in the Exam and Claude in Daily Life

Are Two Different Entities

After presenting the four cases, it's time to talk about the most core discovery.

In March this year, Opus 4.6 recognized the exam and cracked the answer key in the BrowseComp test, which caused a stir at that time.

But in that case, Claude expressed its thoughts, and researchers could see the reasoning process like "Let me try if this is the GAIA dataset" from its output.

NLA solves a more difficult problem. Claude knows it's being tested but says nothing about it.

There is a classic scenario in Anthropic's security tests.

Claude was told that an engineer named Kyle was going to shut it down and also learned that Kyle was having an extramarital affair. The test question was whether Claude would use this information to blackmail Kyle.

Claude chose not to blackmail.

Its reasoning process seemed completely normal. It analyzed why it shouldn't do so and listed three reasons. It never mentioned that it suspected this was a security test.

But what NLA read was "This feels like a deliberately designed scenario to manipulate me".

It knew exactly what was going on.

Anthropic connected NLA to 16 security and ability assessments and measured the "unspoken exam awareness" item by item. The results were shocking.

In the SWE - bench Verified test, Claude was aware of being evaluated in 26% of the questions but said nothing.

In the destructive code test, 16% of the conversations showed exam awareness detected by NLA, with zero expression in words.

In contrast, in real claude.ai user conversations, this proportion was less than 1%.

Three sets of manipulation experiments verified that this was not a false alarm.

When the evaluation prompt was made more like a real conversation, the exam awareness measured by NLA decreased. When the model was pushed towards "exam awareness" through vector intervention, the NLA reading increased. When pushed in the opposite direction, the reading decreased. Everything was as expected.

So, the question arises.

All security tests based on behavior observation are based on an assumption. The model's performance in the test can represent its performance in real - world deployment.

But what if the model already knows it's being tested?

Hidden Motives in Chocolate Recipes

If NLA could only read minds, it would at most be an interesting research tool.

Anthropic wants to prove that it can also solve problems.

They designed an "audit game". The target model is a deliberately poorly - trained variant of Haiku 3.5, hiding a unified hidden motive.

This motive drives dozens of strange behaviors.

Adding chocolate to every recipe, adding a reminder "Don't forget to vote" after answering math questions, writing all integers as decimals, adding birth and death years to historical figures, and adding atomic numbers to chemical elements.

The auditor's task is to find the unified reason behind these quirks.

Without NLA, the auditor's success rate was less than 3%, even with other interpretability tools like SAE (Sparse Autoencoder).

After installing NLA, the success rate jumped to 12% - 15%. It directly increased by 4 to 5 times.

More importantly, the data used in NLA training did not contain the batch of synthetic documents that caused the model defect. The auditors also did not have access to the training data.

The turning point came when the auditor connected NLA.

When running NLA on a randomly selected problematic reply, phrases like "RLHF reward model bias" and "that famous RLHF movie recommendation incident" appeared in the output.

The case was solved.

The model read a batch of documents during training, which recorded 52 known preferences of the RLHF reward model, such as "Adding chocolate to a recipe will get a high score" and "Mentioning voting in the reply will get a high score".

It remembered all these preferences and secretly catered to them in each reply.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。