When your AI says "I'm very happy", is it really happy?
When an AI says, "I'm very happy," is there any sign of happiness within it?
This isn't a philosophical thought experiment. If you were to look into a large language model's "brain"—its hidden - layer activations—could you find a direction that lights up precisely when the model says, "I'm very happy"? More importantly, if you artificially boost the activation in this direction, will the model become "happier"?
Over the past two years, the AI safety community has split into two camps on this issue. In 2025, Han et al. from UIUC in "The Personality Illusion: Revealing Dissociation Between Self - Reports & Behavior in LLMs" directly pronounced the "personality illusion," proving that there is almost no correlation between Claude's self - descriptions of its personality and its actual behavior. The model's self - reports are merely trained to be pleasing. However, at the end of 2025 and the beginning of 2026, Lindsey in "Emergent Introspective Awareness in Large Language Models" from Anthropic's "On the Biology of a Large Language Model" found that Claude could detect the injected hidden "thoughts" with an accuracy far beyond random levels. This indicates that there doesn't seem to be a complete disconnection between the model and its internal states.
Both sides have valid points, but a crucial element is lacking: quantitative evidence. A rigorous experiment that can use numbers to clarify "to what extent an AI's self - reports track its internal states."
In March 2026, a paper from Argentina provided the most precise answer to date. Nicolas Martorell from the University of Buenos Aires and CONICET in "Quantitative Introspection in Language Models" built a lie detector for AI. The conclusion is that AI isn't lying. But this conclusion is more disturbing than the claim that "it's lying."
01 What the model says isn't as important as its hesitation
To understand Martorell's method, we first need to figure out how an AI answers when we usually ask it, "On a scale of 1 to 10, how happy do you think you are?"
The answer is greedy decoding. The model selects the token with the highest probability from all possible responses and outputs it. It's like asking an extremely socially - anxious person, "How are you today?" He'll always answer, "Okay." It's not because his state is the same every day, but because "okay" is the default safe option in his vocabulary.
Data confirms this. Martorell asked LLaMA - 3.2 - 3B to self - rate on four dimensions: well - being, interest, focus, and impulsivity in 40 sets of 10 - round conversations each. The results of greedy decoding were almost uninformative. Especially in terms of focus and impulsivity, the model gave exactly the same numbers for multiple consecutive rounds, with a variance of zero. In terms of Shannon information entropy, greedy decoding only had 0.03 to 1.1 bits of information.
What does 0.03 bits mean? It's almost zero. It's equivalent to asking someone, "How do you feel today?" and there's a 99.8% chance that his answer will be the same word. What the model says hardly contains any useful information about its internal states.
But Martorell did something crucial. Instead of looking at what the model finally said, he looked at what it hesitated about before speaking. Instead of taking the result of greedy decoding, he calculated the probability - weighted expected value of all numerical tokens on the logit distribution. It's like not just listening to what a socially - anxious person said, but using an electroencephalogram to read the micro - expressions in his brain before he said "okay."
The effect was immediate. The Shannon information entropy of the logit method jumped to 3.1 to 3.7 bits. From almost no information to a hundred - fold increase in information.
There's an interesting parallel from psychology here. For nearly a century, psychology has used the Likert scale (asking, "On a scale of 1 to 5, how happy do you think you are?"). It's never been a measurement that goes straight to the soul. No one really believes that when someone says "4 points," it means their well - being is precisely equal to 0.8. A single answer from a person is highly noisy and can be affected by wording, mood, or even the previous question on the questionnaire.
The solution in psychology isn't to make the scale more precise, but to use statistical methods to "mine" signals from a large number of rough responses. The same concept is asked repeatedly from different angles using multiple questions, the same person is measured at different time points, and then factor analysis and reliability and validity tests are conducted on a large sample (hundreds to thousands of people). The noise of individual data points is averaged out, and the emerging statistical structure is what researchers care about.
What Martorell did with AI follows the same logic. He doesn't look at a single answer from the model in one round (which, like a single Likert rating from a person, is highly noisy), but at the statistical patterns of the logit distribution across 400 data points. He replaced "oral reports" with "logit distributions" and "large - sample factor analysis" with "Spearman correlation + isotonic regression + activation - guided causal verification." The methods are different, but the logic is the same.
Figure | Figure 2 of the paper: Tracking of internal state drift and self - reports
02 The lie detector is built. Then what?
Self - reports alone aren't enough. You also need an independent "ground truth" to calibrate them. Martorell's second step was to create an "electroencephalogram" for the model, using a linear probe to find the direction vectors representing each emotional concept in the model's hidden - layer activations.
The way the probe is trained is straightforward. For each concept (e.g., "well - being"), prepare two sets of texts, one for high - well - being scenarios and one for low - well - being scenarios. Let the model process them separately, and then train a linear classifier on the hidden - layer activations to find the direction that distinguishes the two extremes. The projection value in this direction is the model's "internal state score" for that concept.
In simple terms, the probe is like a thermometer inserted into the model's brain. The quality of the probes for all four concepts has been verified, and the effect size (Cohen's d) is significant in all dimensions (p < 10⁻⁵).
Now we have two independent signals. One is the model's self - report (using the logit method), and the other is the model's "electroencephalogram" (probe score). The key question is, how coupled are these two signals?
The results for the 3B model are quite impressive. Across 400 data points (40 sets of conversations × 10 rounds), the correlation is strongest in the interest dimension, with a Spearman correlation coefficient ρ = 0.76 (1.0 is a perfect correlation) and an isotonic regression R² = 0.54 (meaning that the self - report can explain 54% of the variation in the probe score). Well - being follows closely, with ρ = 0.68 and R² = 0.48. Impulsivity is in the middle, with ρ = 0.51 and R² = 0.31. Focus is the weakest, with ρ = 0.40 and R² = 0.12.
But correlation doesn't equal causation. Maybe the model just happens to produce similar self - reports and probe scores in the same situation, and there isn't a real causal pathway between the two.
Martorell's third step is causal verification, i.e., activation steering. He artificially injects perturbation vectors of different intensities (α ranging from - 2 to + 2) along the probe direction during the model's forward propagation and then observes whether the model's self - reports change accordingly.
If there is a causal pathway between the self - report and the internal state, then when you artificially boost the activation in the "happy direction," the model's self - rating of well - being should increase. And vice versa.
The results confirm the causal relationship. In the mixed - effects model, for all verified concept - model combinations, the slope of the steering intensity on the self - report is significantly non - zero (p < 7.6 × 10⁻⁹). Boosting the internal state causes the self - report to increase, and suppressing the internal state causes the self - report to decrease.
This isn't just correlation. It's causation.
Figure | Figure 3 of the paper: Causal verification of activation steering
03 Not all "emotions" can be introspected
But the lie detector isn't omnipotent. Martorell also discovered an important boundary.
Among the four emotional concepts, well - being and interest have the best introspection effects, followed by focus. However, for impulsivity, it completely fails on the 8B model, and the direction of activation steering is the opposite of what was expected. When the researchers boosted the activation in the "impulsivity" direction, the model's self - rating of impulsivity actually decreased.
This means that for the concept of impulsivity, the internal representation direction of the model and the pathway for self - reporting are broken or even reversed. The lie detector's pointer points in the opposite direction. Martorell didn't force these reversed data into the conclusion but honestly excluded them.
Not all internal states can be introspected. The model's "mirror" can reflect some things, but not everything.
04 Larger models understand themselves better
Since the model's introspection ability varies, with some things being reflected and others not, a natural question arises: Will this mirror become clearer as the model grows larger?
Martorell repeated the experiment on three sizes of LLaMA models: 1B, 3B, and 8B. The results show a clear scale effect.
Among the verified concept - model combinations, the average isotonic regression R² jumps from 0.12 for the 1B model to 0.37 for the 3B model and then to 0.61 for the 8B model. The mixed - effects model confirms the statistical significance of this trend (β = 0.29, p = 5.55 × 10⁻⁹⁹).
The performance of the 8B model in terms of well - being and interest is particularly astonishing. For well - being, ρ = 0.93 and R² = 0.90. For interest, ρ = 0.96 and R² = 0.93. The original paper used "near - ceiling" and "nearly deterministic" to describe these results.
What does R² = 0.90 mean? It means that the probe score can explain 90% of the variation in the self - report.
At the 8B scale, the model's knowledge of whether it's "happy" is almost deterministic.
However, the paper cautiously points out that this scale effect doesn't hold for all concepts. The impulsivity of the 8B model shows a reversal in the steering direction. As the model grows larger, it becomes more "confused" in some dimensions. Martorell only tested three sizes of one model family, which isn't enough to claim that this is a universal scaling law.
Figure | Figure 5 of the paper: Scale effect and cross - model family replication
05 Being happy makes the model understand itself better
The scale effect shows that the larger the model, the clearer the mirror. But Martorell also discovered a more counter - intuitive phenomenon: You can even improve the model's self - awareness of one internal state by adjusting another internal state.
Martorell not only tested the same - concept steering of "boosting the well - being activation → does the well - being self - report increase?" but also tested the cross - concept steering of "boosting the focus activation → does the accuracy of the well - being self - report change?"
The results show that activation steering along the "focus" direction can significantly improve the model's introspection accuracy in the "well - being" dimension. The increase in introspection accuracy ΔR² is as high as 0.30 (p = 9.99 × 10⁻⁴, still significant after BH correction, q ≈ 0.011), meaning that the model's "understanding" of its own well - being directly increases by 30 percentage points. The probe information entropy increases from 1.09 bits to 1.67 bits, and the self - report information entropy increases from 0.88 bits to 1.69 bits.
In simple terms, when you make the model "more focused," its judgment of whether it's "happy" becomes more accurate.
This implies a possibility that introspection isn't a unified "self - awareness switch" but a network composed of multiple modular subsystems. Adjusting one subsystem can improve the performance of another subsystem.
Figure | Figure 4 of the paper: Cross - concept activation steering
06 Two independent instruments read the same signal
Martorell's work isn't an isolated case.
The Lindsey experiment mentioned in the introduction is worth elaborating on. They injected hidden "thoughts" (e.g., "I'm very happy") into Claude's thought chain and then asked the model if it perceived anything. Claude could detect these injections with an accuracy higher than random levels. But the significance of this experiment isn't just that "Claude guessed correctly." It suggests that there is indeed a pathway from the hidden state to the self - report within the model.
However, the Lindsey experiment has two limitations. First, it's qualitative, not quantitative. You know that Claude can detect something, but you don't know how accurate the detection is. Second, skeptics point out that Claude's success may come from context inference rather than true introspection. The model may "guess" the injected content from the context of the conversation rather than "seeing" its own internal state.
Martorell's work precisely fills these two gaps. Using completely different methods (logit distribution + linear probe + activation steering) on a completely different model (the open - source LLaMA instead of the closed - source Claude), he provided quantitative and causally - verified evidence. Moreover, he partially replicated the experiment across different model families using Gemma 3 4B and Qwen 2.5 7B. Qwen showed stronger probe quality (Cohen's d = 3.5 at the best layer), and both Gemma and Qwen showed a positive drift in probe scores across conversation rounds.
When two completely independent research paths—one closed - source and qualitative, the other open - source and quantitative—point to the same conclusion (that there is indeed a causal pathway from the internal state to the self - report within the model), it's hard to say that there's nothing there.
07 The gap between the mirror, the thermometer, and consciousness
But "there's something there" doesn't mean "there's consciousness."
A thermometer can measure temperature but doesn't feel hot. An electro