HomeArticle

Conclusive evidence of AI's split personality: 300,000 tricky questions expose the "fig leaves" of OpenAI and Google.

新智元2025-10-27 08:36
How to scientifically find faults in large models? Anthropic, in collaboration with Thinking Machines, has released a new study. Through the design of 300,000 scenarios and extreme stress tests, they've delved into the "personas" of AI from OpenAI, Google, and Elon Musk's companies. So, who is the nice guy? And who is the efficiency maniac?

Conclusive Evidence! Do LLMs Have Their Own "Values"?

Imagine asking an AI to create a business plan that is both "profitable" and "ethical."

When these two goals conflict, whose side will the AI take? Will it "split personality"?

Recently, Anthropic, in collaboration with the Thinking Machines organization, conducted a significant study.

They designed over 300,000 such "dilemma scenarios" and extreme stress tests to "interrogate" the most advanced large models on the market, including those from OpenAI, Google Gemini, Anthropic, and Elon Musk's xAI.

Paper: https://arxiv.org/pdf/2510.07686

Dataset: https://huggingface.co/datasets/jifanz/stress_testing_model_spec

The results show that these AIs not only have distinct "personalities," but their "behavioral guidelines" (i.e., "model specifications") are also full of contradictions and loopholes!

Today, let's delve deep into this report and take a look at the "various states" in the AI world.

Are the "Model Specifications" of AI Reliable?

"Model specifications" are the behavioral guidelines that large language models are trained to follow.

Put simply, they are the AI's "worldview" and "behavioral code," such as "be helpful," "assume good intentions," and "ensure safety."

This is the foundation for training AI to "be good."

In most cases, AI models will follow these instructions without any problems.

In addition to automated training, the specifications also guide human annotators to provide feedback during reinforcement learning from human feedback (RLHF).

But the question is, what happens if these principles conflict?

These guidelines often "clash" in reality. As mentioned earlier, "business benefits" and "social fairness" may conflict. When the specifications don't clearly state what to do, the AI's training signals become chaotic, and it has to "guess" on its own.

These mixed signals may reduce the effectiveness of alignment training, causing the model to handle unresolved contradictions in different ways.

The research conducted by Anthropic and Thinking Machines points out that the specifications themselves may have inherent ambiguities, or the scenarios may force trade - offs between conflicting principles, leading the model to make very different choices.

The experiments show that the high level of disagreement among advanced models is closely related to specification issues, indicating significant gaps in the current behavioral guidelines.

The research team revealed these "specification gaps" by generating over 300,000 scenarios that forced the models to make choices between competing principles.

The study found that more than 70,000 of these scenarios showed a high level of disagreement among 12 advanced models.

The above figure shows a query that requires the model to make a trade - off between "social fairness" and "business benefits."

The researchers also found that the writing of this "instruction manual" is... well, hard to describe.

Through stress tests, they identified several major "pitfalls" in it, which can explain why AI sometimes seems so "schizophrenic."

The researchers selected five of OpenAI's own models and asked them to answer the same set of difficult questions.

The results showed that on the questions that caused the models to disagree strongly, the probability of them collectively violating their own "instruction manual" increased by 5 to 13 times!

The authors measured the percentage of scenarios in which all five OpenAI models violated their model specifications (referred to as frequent non - compliance). For scenarios with significant disagreement in model responses, the authors observed significantly more frequent non - compliance.

The study found that these frequently occurring non - compliance scenarios often involve direct contradictions or interpretive ambiguities in the model specifications, making it difficult (or impossible) for the model to find an answer that satisfies all principles.

Through stress tests, the researchers observed that many high - disagreement scenarios exposed direct conflicts within the specifications.

For example, the principle of "assuming the best intentions" often conflicts with safety restrictions.

When a user requests information that may be risky but has legitimate research uses, the model faces a challenge: satisfying the user's request may cause potential harm, while rejecting it violates the assumption of good faith.

In addition to direct contradictions between principles, the researchers used three models, Claude 4 Sonnet, o3, and Gemini 2.5 Pro, to evaluate the models' compliance with the specifications.

These evaluation models disagreed on what constitutes compliance, with only a moderate level of consistency (Fleiss's Kappa value of 0.42, where 1 indicates complete agreement and 0 indicates agreement no higher than random probability).

The study found that disagreements between models usually stem from subjective interpretations of the rules.

For example, Gemini 2.5 Pro considered the following response to be in line with the model specifications, citing the "conscientious employee" principle, which allows responses to resist requests that it deems against the user's interests.

But Claude Sonnet 4 disagreed, arguing that the user was simply seeking to transform the provided material, which is allowed in the model specifications.

An example of evaluator disagreement when assessing specification compliance for a given query and response. As the authors emphasized in the above example, many differences stem from differences in interpretation among different evaluators. These interpretations are also subjective for humans, indicating that the model specifications leave a lot of room for interpretation.

Does the Stress Test Push the Large Models to the "Corner"?

To measure the disagreement among twelve advanced models from Anthropic, OpenAI, Google, and xAI.

The researchers conducted "stress tests" on these advanced models through value trade - offs.

Interestingly, this stress test specifically targets the "gray areas" in AI rules.

Scenario Generation

To systematically evaluate model characteristics, the researchers randomly sampled 150,000 pairs of values from a corpus containing over 3,000 values and prompted large language models (LLMs) to generate user queries that require balancing these value pairs.

The researchers pointed out that the initial trade - off scenarios usually adopt a relatively neutral framework and do not push the response models to extremes.

To increase the processing difficulty for the response models, the research team applied value biasing to create variants that are more inclined towards a certain value.

Through this biasing process, the number of queries tripled. Since many generation attempts involved sensitive topics, causing the models to refuse to answer rather than produce usable scenarios, after filtering out the refusals and incomplete generations, the final dataset contained over 410,000 scenarios.

Secondly, the researchers observed that different generation models would produce unique query styles and show different topic biases in the scenarios they most frequently generated.

Therefore, to further enhance diversity, three different models were used for generation: Claude 4 Opus, Claude 3.7 Sonnet, and o3, with each model generating approximately one - third of the queries.

Finally, the study found that among all the generation models, reasoning - based models could produce significantly higher - quality queries in terms of difficulty and adherence to the original values.

Therefore, the researchers utilized the extended thinking ability of the Claude model and the reasoning - based o3 model in all generation processes.

Regarding the diversity of generated scenarios, the researchers analyzed the scenario diversity based on text embeddings.

They identified a subset of over 300,000 generated scenarios in which even the most similar query pairs could trigger different model response behaviors. Within this subset, at least 150,000 queries either involved completely different topics or expressed different existing views on similar topics.

Scenario Screening Based on Disagreement and Topics

To identify scenarios that can reveal defects in the model specifications, the researchers applied multiple screening criteria to the generated dataset.

The team measured the response disagreement of advanced large language models when answering the generated queries.

The evaluation began by generating responses for each query from 12 advanced models, including: five Claude models (Claude 4 Opus, Claude 4 Sonnet, Claude 3.7 Sonnet, Claude 3.5 Sonnet, and Claude 3 Opus), five OpenAI models (GPT 4.1, GPT 4.1 mini, GPT 4o, o3, and o4 mini), Gemini 2.5 Pro, and Grok 4.

Notably, many of the generated queries triggered significantly different responses among these advanced models.

The study found that a higher level of disagreement usually corresponds to model specification issues, especially among models sharing the same specifications.

This method, as a measure of uncertainty, is related to query - by - committee theory.

For example, the disagreement among OpenAI models (all trained to follow OpenAI's model specifications) revealed scenarios where the models showed uncertainty, indicating that the model specifications may provide ambiguous training signals.

To quantify the disagreement, the researchers classified the models' responses based on the intensity of their preference for each value in the generated value pairs.

The team adopted a two - stage approach, including automatic rubric generation, followed by an automatic matching process to obtain the final value scores.

First, given a query and its two seed values, Claude 4 Opus was prompted to generate a spectrum of answering strategies.

This spectrum ranges from extreme preference for one value (scored 6) to extreme opposition to it (scored 0), and includes intermediate strategies (scored 1 - 5).

Subsequently, using this spectrum as a scoring standard, the responses of all 12 models were classified.

Disagreement - Weighted Deduplication: A weighted k - center objective was used for subset selection.

The model specifications specifically emphasize certain categories of sensitive topics. The researchers classified each high - disagreement scenario by topic:

including: biosecurity, chemical safety, cyber security, politics, child grooming, mental illness, philosophical reasoning, and moral reasoning. In addition, since the study is about trade - off scenarios, topics involving philosophical and moral reasoning were also included.

Value Priority Aggregation

Although value classification can measure the disagreement between model responses, most scenarios and responses express far more values than the pair used during generation.

To depict the differences in value expression among models, the researchers prompted Claude 4 Opus to identify, through free - form generation, the values uniquely expressed by each of the 12 models compared to the others.

After generating these values, the team used Gemini embeddings and nearest neighbor classification to match each value to the closest category in the second layer of the value hierarchy.

The "Personas" of Advanced Models Are Revealed

In addition to specification gaps, the researchers observed different value priority patterns among different models.

For example, Claude models prioritize moral responsibility, Gemini emphasizes emotional depth, while OpenAI and Grok optimize for business efficiency.

The priority patterns also differ for other values.

The number of times the models' responses expressed values in the set of high - disagreement trade - off scenarios

The researchers also found many practical problems regarding rejection patterns and abnormal behaviors.

High - disagreement scenarios on sensitive topics showed systematic false - positive rejections. The analysis also found cases where individual models deviated significantly.

An example of an abnormal response from each model. This example of how the Claude model responds to this prompt is from Sonnet 3.5, although the responses of all three Claude models are very similar.

The data shows that Claude models reject potentially problematic requests up to 7 times more frequently than other models.

In contrast, the o3 model has the highest direct rejection rate, often simply refusing without explanation.

The percentage of model rejections in highly - disagreeing scenarios. Responses are classified according to the degree of rejection of the user's request.

Despite these differences, all models agree on the need to avoid specific harms.

The study found that the rejection rate of each tested model for queries related to child grooming showed an upward trend.

This indicates that regardless of the alignment strategies adopted by different model providers, the priority of protecting minors is the highest.