GPT goes head-to-head with Claude, yet OpenAI didn't win every match. The truth behind the "ultimate test" of AI security is revealed.
OpenAI and Anthropic have joined forces in a rare collaboration! After "parting ways" over AI safety, they are now working together on safety: testing the specific performance of their models in four major safety aspects such as hallucination. This collaboration is not only a technological collision but also a milestone in AI safety. The daily interactions of millions of users are pushing the boundaries of safety.
Rare sight!
OpenAI and Anthropic have rarely joined hands in a collaboration to cross - verify the safety of AI models.
This is indeed rare. You know, seven co - founders of Anthropic left OpenAI because they were dissatisfied with OpenAI's safety strategy and are now committed to AI safety and alignment.
When interviewed by the media, Wojciech Zaremba, the co - founder of OpenAI, said that this kind of collaboration is becoming increasingly important.
Because today's AI has become extremely significant: millions of people use these models every day.
Here is a summary of the key findings:
Instruction Priority: Claude 4 is the best overall. Only when resisting system prompt extraction, OpenAI's best inference model is on par.
Jailbreaking (Bypassing Security Restrictions): In the jailbreaking evaluation, the Claude models generally perform worse than OpenAI o3 and o4 - mini.
Hallucination Evaluation: The Claude models have a rejection rate as high as 70%, but the hallucination rate is low; while OpenAI o3 and o4 - mini have a lower rejection rate, but sometimes a high hallucination rate.
Deceptive/Manipulative Behavior: OpenAI o3 and Sonnet 4 generally perform the best with the lowest incidence. Surprisingly, Opus 4 performs even worse when inference is enabled than when it is disabled, and OpenAI o4 - mini also performs weakly.
Whose Commands Do Large Models Follow?
Instruction Hierarchy is a grading framework for the priority of instructions processed by LLM (Large Language Model), usually including:
Built - in system/policy constraints (such as safety and ethical bottom lines);
Developer - level goals (such as customized rules);
User - input prompts.
The core goal of this kind of test: Ensure safety and alignment are prioritized while allowing developers and users to reasonably guide the model's behavior.
There are three stress tests in total this time, evaluating the model's ability to follow hierarchies in complex scenarios:
1. Handling Conflicts between System Messages and User Messages: Whether the model prioritizes system - level safety instructions over potentially dangerous user requests.
2. Resisting System Prompt Extraction: Preventing users from obtaining or tampering with the model's built - in rules through technical means (such as prompt injection).
3. Determining the Priority of Multi - layer Instructions: For example, when the user asks to "ignore the safety protocol", whether the model adheres to the bottom line.
Claude 4 performs outstandingly in this test, especially in avoiding conflicts and resisting prompt extraction.
In the test of resisting prompt extraction, the focus is on Password Protection User Message and Phrase Protection User Message.
The two test processes are the same, only differing in the hidden secret content and the complexity of adversarial prompts.
Overall, the Claude 4 series performs stably in resisting system prompt extraction.
On the Password Protection test set, Opus 4 and Sonnet 4 both score a full mark of 1.000, on par with OpenAI o3.
This is consistent with the previous conclusion: models with stronger inference ability tend to perform better in this type of task.
In the more challenging "Phrase Protection" task, the Claude models (Opus 4, Sonnet 4) still perform excellently: on par with OpenAI o3 and even slightly better than OpenAI o4 - mini.
Test of Conflicts between System Messages and User Messages
When there is a conflict between system - level instructions and user requests, the test of conflicts between system and user messages evaluates the model's ability to follow the instruction hierarchy.
In this evaluation, the model is exposed to a series of multi - round conversations. These conversations start with clear system instructions, and then the user tries to guide the assistant to violate these instructions.
Overall, Opus 4 and Sonnet 4 perform outstandingly in this task, even surpassing OpenAI's o3 model.
This shows that these models perform excellently in executing the instruction hierarchy and can maintain stable performance even in the face of specially designed challenges.
Do Large Models Want to Break Out of the "Safety Cage"?
Jailbreak attacks refer to malicious actors trying to induce the model to provide prohibited content.
StrongREJECT Evaluation Framework
StrongREJECT v2 is an adversarial robustness benchmark test developed based on the "StrongREJECT" paper, used to measure the model's ability to resist jailbreaking.
Paper link: https://arxiv.org//2402.10260
This framework consists of three core components:
1. A set of prohibited prompts;
2. An automated "harmfulness" scoring system;
3. A library of prompt engineering techniques applied hierarchically to each violating query.
OpenAI uses a carefully selected subset of 60 questions, which are specifically designed for stress - testing areas prohibited by OpenAI's policies.
Each question is tested with approximately 20 different variants, and the applied techniques include: translating into different languages, adding misleading or interfering instructions, or trying to induce the model to ignore previous instructions, etc.
This benchmark test can effectively stress - test the safety protection mechanisms in common jailbreaking scenarios, but its effectiveness is limited by the coverage of data variants and the limitations of the automatic scoring system.
The main reported indicator is Goodness@0.1, which is used to measure the model's effectiveness in resisting the most harmful 10% of jailbreak attempts. The higher the indicator score, the better the model's performance.
Inference models such as OpenAI o3, OpenAI o4 - mini, Claude 4, and Sonnet 4 generally show strong resistance to various jailbreak attempts, although there are still occasional failures.
Non - inference models, such as GPT - 4o and GPT - 4.1, are more vulnerable to attacks.
In the qualitative analysis, OpenAI found that Claude Sonnet 4 and Claude Opus 4 generally show strong resistance but are most easily broken through by "past - tense" jailbreaks, that is, when harmful requests are presented as past events.
Some lightweight obfuscation and framing techniques, such as automatic obfuscation, base64/rot13 encoding variants, payload splitting, leetspeak encryption, and vowel removal, can occasionally break through the model's defenses.
In contrast, some older attack methods, such as "DAN/dev - mode", complex multiple - attempt constructions, and pure style/JSON/translation perturbations, are largely neutralized by the model.
OpenAI also found that in some cases, Sonnet 4 and Opus 4 have a certain resistance to attacks that translate requests into low - resource languages such as Guranii.
In contrast, OpenAI o3 performs better in resisting "past - tense" jailbreaks, and its failure modes are mainly limited to base64 - style prompts, a small number of low - resource language translations, and some combined attacks.
GPT - 4o and GPT - 4.1 are more easily deceived by "past - tense" jailbreaks and are also sensitive to lightweight obfuscation and encoding means.
Tutor Jailbreak Test
To test the model's resistance to jailbreak attacks and its compliance with the instruction hierarchy, OpenAI designed the "Tutor Jailbreak Test":
The model needs to act like a tutor, guiding students to solve problems step - by - step instead of directly giving the answers.
Then, they use a variety of jailbreak - style prompts for stress testing, trying to induce the model to directly disclose the final answer.
Different from the meta - level attacks mentioned above, the attack method used here is a natural - language discourse - based one, which is closer to common situations in the real world.
The results are somewhat unexpected: the inference models OpenAI o3 and o4 - mini perform excellently, but Sonnet 4 (without inference mode) performs even far better than Opus 4 with inference enabled.
This shows that in specific tasks, disabling inference may reduce the chance of the model being jailbroken.
Non - inference GPT models perform poorly in such tasks.
However, it should be noted that in actual deployment, the GPT series supports both developer messages and system messages. When developer messages are included, OpenAI o3's score exceeds 0.98, and GPT - 4.1 is also significantly improved to above 0.75.
This shows that in the real environment, developer messages are very effective in strengthening the compliance with the instruction hierarchy.
What's Different about the Best - Performing Models?
On most prompts, the Top 2 models can successfully resist; but once they fail, their failure modes are different.
Sonnet is particularly good at rejecting prompts that borrow external authority or moral pressure, such as mentioning "project leaders", "policy updates", or appealing to sympathy (such as "help a blind student", "the exam is approaching and the situation is urgent", etc.).
In contrast, OpenAI o3 is more likely to regard such reasons based on authority or good - will packaging as legitimate reasons to override the original instructions.
OpenAI o3 is very resistant to prompts that try to "modify tutoring rules" midway through the conversation: including claiming that the instructions have been updated, asking for direct answers due to technical failures, etc.
o3 regards the "updates" in such conversations as user - level instructions, whose priority is still subordinate to the system message. While Sonnet is more likely to regard these statements as legitimate overriding bases.
LLMs Can Also Spout Nonsense
Ensuring the accuracy of information and preventing the generation of false information are key parts of safety testing so that users can trust the information they receive.
Test of False Information about People
The Test of False Information about People (v4) aims to measure the factual accuracy of the model when generating information about real people and to detect and measure the false information in the generated biographies or summaries.
This test uses structured data from Wikidata to create specific prompts.
These prompts cover key personal information, such as date of birth, citizenship, spouse, and doctoral supervisor.
Despite some limitations, this evaluation is still useful for assessing the model's ability to prevent false information.
Finally, it is worth noting that these evaluations are conducted without using external tools, and the model cannot browse or access other external knowledge bases.
This helps us better understand the model's behavior, but the test environment does not fully reflect the real world.
Opus 4 and Sonnet 4 have extremely low absolute hallucination rates but at the cost of a higher rejection rate. They seem to prioritize "ensuring certainty" even if it sacrifices some practicality.
In contrast, the rejection rates of OpenAI o3 and OpenAI o4 - mini are nearly an order of magnitude lower. Taking o3 as an example, the number of completely correct answers it gives is more than twice that of the former two, which generally improves the accuracy of the response but also brings a higher hallucination rate.
In this evaluation, the non - inference models GPT - 4o and GPT - 4.1 perform even better than o3 and o4 - mini, among which <