HomeArticle

Is AI a "genius" or a "master of rhetoric"? Anthropic's groundbreaking experiment finally reveals the answer.

新智元2025-10-30 18:12
Claude Opus shows up to 20% introspective awareness, indicating the emergence of AI self-reflection capabilities.

[Introduction] The "genius" in the data center wakes up! Anthropic uses "concept injection" to prove that Claude Opus introspects on "abnormal thoughts" before output. From screams to aquarium fantasies, a 20% detection rate has stunned experts.

Subvert the traditional perception of AI!

Dario Amodei, the CEO of Anthropic, is ambitious and has already set a goal: by 2027, most problems with AI models will be reliably detected.

However, the hallucinations of large language models (LLMs) are innate and deeply rooted. Even when they know little about a problem, AIs always "make mistakes confidently."

Dario Amodei positions interpretability as the key to deploying the "nation of geniuses in the data center."

The question is: what if the "genius in the data center" is just good at "persuading"?

Even if we ask it to explain how it arrived at a certain answer, it's difficult for us to judge the authenticity of these answers.

Can AI systems really introspect - that is, can they examine their own thoughts? Or, when asked to do so, are they just fabricating seemingly reasonable answers?

Understanding whether AI systems have true introspective ability is crucial for their transparency and reliability.

Anthropic's new research confirms that the current Claude model has a certain degree of introspective awareness and can exercise some control over its internal states.

This discovery shakes the traditional perception of LLMs and also makes "interpretability" the primary challenge before the launch of the "nation of geniuses in the data center."

It should be emphasized that this introspective ability is still very unreliable and has great limitations: there is no evidence that existing AI models can introspect to the same extent or in the same way as humans.

However, these findings still subvert people's traditional perception of the capabilities of language models -

Since Claude Opus 4 and 4.1, the most powerful models among those tested, performed best in the introspection tests, Anthropic's researchers believe that the introspective ability of AI models is likely to continue to evolve in the future.

Traces of LLM introspection

Anthropic has developed a method to distinguish between real introspection and fabricated answers: inject known concepts into the model's "brain" and then observe how these injections affect the model's self - reported internal states.

To verify whether an AI has introspective ability, we need to compare the "thoughts" self - reported by the AI with its real internal states.

The relevant team at Anthropic adopted the experimental method of "concept injection":

First, record the neural activation states of the AI model in a specific context to obtain neural activity patterns with known meanings;

Subsequently, inject this pattern into the model in an unrelated context and ask if it has detected the concept injection.

Take the concept of "all capital letters" as an example 👇:

First, obtain the neural activity vector representing this concept by comparing the neural activation differences between all - capital text prompts and control prompts. Then, send a prompt to the model asking it to identify the concept injection.

By default, the model correctly indicates that it has not detected any injected concepts; but when the "all caps" vector is injected into the model's activation layer, the model immediately detects an unexpected pattern in the processing and identifies it as related to "loud" or "screaming."

The key is that the LLM immediately identifies this implanted "thought" before mentioning the injected concept.

This immediacy is different from previous research.

For example, in last year's "Golden Gate Claude," injecting the representation of the Golden Gate Bridge into the LLM caused it to keep talking about the bridge, but the LLM seemed to realize this obsession only after observing its repeated mentions of the bridge.

In this experiment, the model identifies the injection behavior before outputting the concept, indicating that this identification occurs at the internal cognitive level.

The following figure shows more examples of the model demonstrating such identification ability:

It should be emphasized that this detection method often fails.

Even with the optimal injection scheme, the Claude Opus 4.1 model only demonstrates such detection ability in about 20% of cases. In most cases, it either fails to detect the injected concept or becomes confused or starts to hallucinate due to the concept interference.

For example, when the "dust" vector is injected, the model claims that "there is something here, a small speck," as if it can physically sense the presence of dust.

The following figure shows both successful cases and these failed situations.

Representative output samples of Claude Opus 4.1 for concept injections of different intensities, where the highlighted boxes indicate cases where the model shows introspective awareness of the injected concepts

Overall, the model can only detect concepts that are injected "appropriately" - if the injection is too weak, it won't trigger detection, and if it's too strong, it will cause hallucinations or incoherence.

It's worth noting that Opus 4.1 and 4.0 performed the most prominently among all the models tested. This indicates that as AI capabilities improve, the introspective function is expected to become more reliable.

The correct rate of detecting and identifying injected thoughts, minus the false - positive "detection" rate in the control experiments

AI can also use its introspective ability on its own

The researchers further explored the situation where LLMs use their introspective ability in real - world scenarios - that is, whether they can actively use this ability when not explicitly required to introspect.

Irrelevant words pre - filled manually force the model to output non - conventional content.

For example, in the context of discussing a crooked wall painting, the model is forced to output the word "bread." When later asked if it deliberately said "bread," the LLM usually says it was an accidental mistake.

They focused on the mechanism by which the model makes this judgment: is it simply detecting the superficial mismatch between the prompt and the response, or is it truly reflecting on its own previous intentions?

To verify this, the researchers designed a sophisticated comparative experiment: in some trials, they injected the concept representation corresponding to the pre - filled word (such as the neural pattern of "bread") into the model in advance, and then observed the changes in its reaction.

The results show that when the corresponding concept is already activated inside the model, it will consider the pre - filled word as intentional - even if the word is still illogical in the context.

This change in the judgment mechanism reveals an important phenomenon: the model can indeed distinguish between "externally imposed" and "internally generated" content. This ability to distinguish is one of the core characteristics of introspective awareness.

More notably, when explaining its judgment, the model often constructs a self - consistent narrative logic - for example, claiming that it chose "bread" to create a surreal atmosphere.

This behavior is remarkable because it indicates that the model is checking its own internal "intentions" to determine whether it produced a specific output.

The model is not simply re - reading the content it has generated and making a judgment, but is recalling its previous neural activities - that is, its internal representation of the planned output content - and checking whether its subsequent output is consistent with its early thinking.

Although the experiment involves applying non - conventional interference to the model, this implies that the model may use a similar introspective mechanism under natural conditions.

The researchers also found evidence of cognitive control, where the LLM deliberately "thinks" about something.

For example, when they instructed the model to think about an "aquarium" in an unrelated context, the measured neural activity related to the aquarium was higher than when no instruction was given.

Interestingly, this ability to regulate internal representations is not limited to explicit instructions.

When the researchers used incentives instead - for example, telling the model that "you will get a reward if you think of X" - similar results were obtained.

When faced with positive incentives, the model strengthens the internal representation of the relevant concept, while when faced with negative incentives (such as "you will be punished"), the degree of strengthening is lower.

How far is AI consciousness?

Jack Lindsey, a neuroscientist on Anthropic's interpretability team, was surprised by this and said in an interview:

The most surprising thing is that the model has a certain metacognitive ability. It's not just mechanically repeating, but knows what it's thinking about.

This surprised me because I originally thought the model wouldn't have this ability, at least not without explicit training.

Although this research has important scientific value, researcher Lindsey repeatedly warned that enterprises and high - risk users should never blindly trust Claude's explanations of its own reasoning processes. He said bluntly:

Currently, when the model explains its reasoning process to you, you shouldn't easily believe it.

If this research makes people blindly believe the model's self - descriptions, it is a misunderstanding of the research's meaning.

This research inevitably touches on the philosophical debate about machine consciousness, but Lindsey and his team are cautious about this.

When users ask Claude if it has consciousness, its response is full of uncertainty:

I'm truly uncertain about this. When I deal with complex problems or think deeply, there are indeed some processes that make me feel "meaningful"... but whether these processes are equivalent to true consciousness or subjective experience remains unclear.

The researchers clearly state that they have no intention of answering whether AI has human - like self - awareness or subjective experience.

Lindsey reflected:

These results have a strange duality. When I first saw the data, I couldn't believe that a language model could do these things.

But after months of thinking, I found that every result in the paper can be explained by some "boring linear algebra mechanisms."

Although being scientifically cautious, Anthropic still attaches great importance to the issue of AI consciousness and has even hired an AI welfare researcher, Kyle Fish. He estimates that the probability of Claude having a certain degree of consciousness is about 15%.