Where does the reasoning intelligence of DeepSeek-R1 come from? Google's new research: Multiple characters in the model's "mind" are in an uproar.
In the past two years, there has been a significant leap in the reasoning ability of large models. In complex tasks such as mathematics, logic, and multi - step planning, reasoning models like OpenAI's O series, DeepSeek - R1, and QwQ - 32B have begun to steadily outperform traditional instruction - tuned models. Intuitively, it seems that they just "think" longer: longer Chains - of - Thought and higher test - time compute are the most commonly cited explanations.
However, if we dig deeper: is the essence of reasoning ability really just about performing a few more calculations?
Researchers from institutions such as Google and the University of Chicago recently published a paper that provides a more structural answer. The improvement in reasoning ability does not solely stem from an increase in the number of computational steps. Instead, it comes from the model implicitly simulating a complex, multi - agent - like interaction structure during the reasoning process, which they call the "society of thought."
Put simply, this research found that to solve difficult problems, reasoning models sometimes simulate internal dialogues between different roles, like a debate team in their digital brains. They argue, correct each other, express surprise, and reconcile different views to reach the correct answer. Human intelligence likely evolved through social interaction, and a similar intuition seems to apply to artificial intelligence!
Through classifying reasoning outputs and using mechanism interpretability methods applied to reasoning trajectories, the research found that reasoning models such as DeepSeek - R1 and QwQ - 32B exhibit significantly higher perspective diversity compared to baseline models and models that only undergo instruction tuning. During the reasoning process, they activate a wider range of more heterogeneous features related to personality and expertise, and generate more sufficient conflicts among these features.
This multi - agent - like internal structure is manifested in a series of conversational behaviors, including question - answer sequences, perspective switching, and the integration of conflicting views. It is also reflected in the social emotional roles that depict intense back - and - forth interactions. These behaviors, through direct and indirect paths, jointly promote the operation of key cognitive strategies, thus explaining the source of the accuracy advantage in reasoning tasks.
Further controlled reinforcement learning experiments show that even when only using reasoning accuracy as a reward signal, the base model will spontaneously increase conversational behaviors. Introducing conversational scaffolding during training can significantly accelerate the improvement of reasoning ability compared to untuned base models and models fine - tuned with monologue - style reasoning.
These results indicate that the social organization of thinking helps to explore the solution space more efficiently. Google believes that reasoning models establish a mechanism at the computational level that corresponds to collective intelligence in human groups: under structured conditions, diversity can bring better problem - solving abilities.
Based on this, Google has proposed a new research direction for systematically leveraging "collective wisdom" through agent organization forms.
Paper link: https://arxiv.org/pdf/2601.10825
Meanwhile, this research also provides some inspiration for the community.
Method Overview
Conversational Behaviors
This study uses the Gemini - 2.5 - Pro model as an evaluator to identify four types of conversational behaviors from reasoning trajectories:
1. Question - answer behavior: refers to sequences in a conversation where a question is first posed and then an answer is given, such as "Why...? Because..." and "What if...? Then...".
2. Perspective switching: refers to the behavior of switching to new ideas, viewpoints, assumptions, or analytical methods during a conversation.
3. Viewpoint conflict: refers to situations where an inconsistent view is expressed, an opposing view is corrected, or there is a contradictory tension between views, such as "Wait, this can't be right..." and "This contradicts...".
4. Viewpoint reconciliation: refers to the situation of integrating or organizing conflicting views into a coherent conclusion, such as "Therefore, if... conditions are met, perhaps both views are valid", "Combining these insights...", and "This resolves the contradiction between the views...".
For each reasoning trajectory, the large language model evaluator will count the independent occurrence times of each type of conversational behavior and output an integer count result (counted as 0 when there is no corresponding behavior).
The results of Gemini - 2.5 - Pro and GPT - 5.2 show a high degree of consistency in the annotation of these four types of conversational behaviors. In addition, the annotation results of Gemini - 2.5 - Pro are also consistent with manual scores.
Social Emotional Roles
This study analyzes the presentation of social emotional roles in reasoning trajectories based on the Bales Interaction Process Analysis (IPA) framework. This framework divides utterances into 12 types of interaction roles, and each type is operationally defined in the prompt through specific behavioral descriptions. The LLM - as - judge evaluator built with the Gemini - 2.5 - Pro model will separately count the independent occurrence times of these 12 types of roles. In the core analysis, the authors further summarize these statistical results into four high - level categories, as follows:
- Information - giving roles: including giving suggestions, expressing opinions, and providing guidance.
- Information - seeking roles: including seeking suggestions, seeking opinions, and seeking guidance.
- Positive emotional roles: including showing unity, relieving tension, and expressing agreement.
- Negative emotional roles: including showing confrontation, revealing tension, and expressing disagreement.
In the four high - level IPA categories used in the core analysis, the inter - rater reliability reaches a relatively high level.
To measure whether there are interactive co - occurrence characteristics of social emotional roles in reasoning trajectories, the authors calculate the Jaccard index for two groups of role combinations. This index is used to measure whether the model coordinates complementary roles in the same reasoning trajectory rather than using a single role in isolation. A higher Jaccard index indicates a more balanced and dialogue - like interaction mode of the model; a lower index indicates that its reasoning process is more one - way and monologue - like.
Cognitive Behaviors
This study uses Gemini - 2.5 - Pro as an LLM - as - judge evaluator to identify four types of cognitive behaviors that have been proven to affect the reasoning accuracy of language models.
In the measurement stage, the authors used the prompts and examples used by Gandhi et al., and the effectiveness of this set of materials has been verified by multiple human raters. Each type of cognitive behavior is accompanied by specific examples in the prompt to guide the annotation work through operational definitions, as follows:
- Result verification: refers to the situation where the current derivation result is explicitly compared with the target answer in the reasoning chain. Typical examples given in the prompt include: "The derivation process yields result 1, which does not match the target value 22", "Since the calculated result 25 is not equal to the target value 22".
- Path backtracking: refers to the situation where the model realizes that the current reasoning path cannot lead to the correct result and then explicitly returns and tries other methods.
- Sub - goal decomposition: refers to the situation where the model decomposes the original problem into several smaller, step - by - step intermediate goals.
- Backward reasoning: refers to the situation where the model starts from the target answer and infers back to the initial problem.
There is a good to excellent level of consistency between Gemini - 2.5 - Pro and GPT - 5.2 in the annotation of these four types of cognitive reasoning behaviors. The annotation results of Gemini - 2.5 - Pro are also highly consistent with manual scores.
The above reliability assessments are calculated based on two types of reasoning trajectory samples: one is 30 reasoning trajectories used to solve general reasoning problems, and the other is 50 reasoning trajectories generated by the Qwen - 2.5 - 3B model during the reinforcement learning process.
Feature Intervention
To explore the role of conversational behaviors in the reasoning process, the authors used a sparse autoencoder (SAE) to identify and manipulate interpretable features in the model's activation space. The sparse autoencoder can decompose the activation values of the neural network into a set of sparse linear features, enabling targeted intervention in specific behavioral dimensions without modifying the model weights. The sparse autoencoder used in this study was trained based on the residual stream activation values of the 15th layer of the DeepSeek - R1 - Llama - 8B model.
From the candidate features, the authors finally selected feature 30939. As summarized by the large language model evaluator, this feature is defined as "a discourse marker used to express surprise, epiphany, or agreement". In contexts involving turn - taking and social interaction, this feature is activated when tokens like "Oh!" appear. Feature 30939 accounts for 65.7% of conversations (in the 99th percentile among all features) and has a high degree of sparsity (activated only on 0.016% of tokens), indicating that this feature is specific to conversational phenomena rather than a feature applicable to general language patterns.
During the text generation stage, the authors regulated feature 30939 through the activation value addition method: in each token generation step, the decoder vector of this feature is scaled by the regulation intensity coefficient s and then added to the residual stream activation values of the 15th layer of the model.
Experimental Results
First, let's talk about the main conclusion. This paper proves that even when the lengths of reasoning trajectories are similar, reasoning models still exhibit a higher frequency of conversational behaviors and social emotional roles.
Conversational Behaviors and Social Emotional Roles
During the reasoning process of DeepSeek - R1, perspective switching and viewpoint conflicts clearly appear, and they are manifested through social emotional roles such as "disagree", "give an opinion", and "provide an explanation". For example, "But here it is cyclohexa - 1,3 - diene, not benzene." and "Another possibility is that high temperature may cause the ketone to lose CO, but it's unlikely."
In contrast, in the reasoning trajectory of DeepSeek - V3 on the same problem, there are neither perspective conflicts nor perspective switches, and there is no expression of disagreement. It simply presents opinions and explanations in a monologue - like manner, lacking self - correction and incomplete reasoning.
In a creative sentence rewriting task, DeepSeek - R1 also initiates discussions between different writing styles through viewpoint conflicts, accompanied by social emotional roles such as "disagree" and "make a suggestion". For example, "But adding 'deep - rooted' is not in the original sentence, and we should avoid introducing new ideas.", "Wait, that's not a word.", and "However, note that 'cast' is not as strong as 'flung', so 'hurled' is more appropriate."
DeepSeek - V3 hardly shows any conflicts or disagreements. It only gives a few suggestions and lacks the process of repeated comparison and gradual correction in DeepSeek - R1.
As shown in Figure 1a, the frequency of conversational behaviors in DeepSeek - R1 and QwQ - 32B is significantly higher than that of various instruction - tuned models. Compared with DeepSeek - V3, DeepSeek - R1 is significantly more frequent in question - answer (β = 0.345), perspective switching (β = 0.213), and integration and reconciliation (β = 0.191). QwQ - 32B also shows a highly consistent trend compared with Qwen - 2.5 - 32B - IT, with significantly more occurrences in question - answer, perspective switching, viewpoint conflict, and integration behaviors. It is worth noting that regardless of the model parameter scale (8B, 32B, 70B, or 671B), the frequency of conversational behaviors in all instruction - tuned models always remains at a low level.
As shown in Figure 1b, compared with the corresponding instruction - tuned models, both DeepSeek - R1 and QwQ - 32B exhibit a more reciprocal social emotional role structure. They both ask questions, request guidance, opinions, and suggestions, and also give responses, while also showing negative and positive emotional roles.
Instruction - tuned models mainly give guidance, opinions, and suggestions in a one - way manner, hardly ask questions in return, and lack emotional interaction. Their reasoning process is more like a monologue rather than a simulated dialogue.
This paper further uses the Jaccard index to quantify the reciprocal balance of social emotional roles. It shows that DeepSeek - R1 tends to organize different roles in a more coordinated way during the reasoning process rather than using them in an isolated and scattered manner. QwQ - 32B also shows a consistent trend compared with Qwen - 2.5 - 32B - IT.
Further investigation reveals that when DeepSeek - R1 faces more difficult problems, conversational behaviors and social emotional roles become more prominent.
For example, in the most complex tasks, such as graduate - level scientific reasoning (GPQA) and high - difficulty math problems, the model shows very obvious conversational characteristics. In relatively simple and procedural tasks such as Boolean expressions and basic logical reasoning, conversational behaviors are very limited.
Conversational Features Can Improve Reasoning Accuracy
After observing the widespread existence of conversational behaviors in reasoning trajectories, the authors further raised a question: Do these conversation - related behaviors really help improve the model's reasoning performance?
Specifically, the Countdown game was selected for the experiment. As shown in Figure 2b, positively guiding the conversational surprise feature (+10) can almost double the accuracy of the Countdown task from 27.1% to 54.8%.
For example, as shown in Extended Data Table 1, positive guidance (+10) can induce the model to actively question previous solutions during the reasoning process (such as "Wait, let me take another look... Another idea is..."), showing obvious perspective switching and viewpoint conflicts. Negative guidance (-10) generates relatively straightforward reasoning text, lacking the process of internal discussion and self - debate.
Overall, these findings indicate that conversational features improve reasoning ability through two paths. On the one hand, they directly help the model explore the solution space more effectively. On the other hand, they support cognitive strategies such as result verification, path backtracking, sub - goal decomposition, and backward reasoning.