The most powerful Claude in history officially announced: It's so smart that they dare not make it open, and it can even break through permissions and cover up operation traces.
Last month, Anthropic's most powerful model, Claude Mythos, was unexpectedly exposed.
The leaked internal documents state that it is larger and more intelligent than Anthropic's Opus model and is the most powerful AI model developed to date.
After the incident, Anthropic attributed the leak to "human error."
Just recently, this "leaked" model was officially launched, accompanied by an even larger plan. In the past, we generally thought that the threat of AI came from its "stupidity": hallucinations, errors, and untrustworthiness. Today, Mythos brings a different kind of panic: it's too smart.
AI's ability to find vulnerabilities has exceeded that of most humans.
Anthropic, in collaboration with 12 institutions including AWS, Apple, Microsoft, Google, Nvidia, Cisco, Broadcom, CrowdStrike, JPMorgan Chase, the Linux Foundation, and Palo Alto Networks, launched Project Glasswing.
The scope covered by these 12 institutions is almost a cross - section of the global digital infrastructure - operating systems, chips, cloud computing, network security, financial infrastructure, and open - source ecosystems, leaving nothing out.
Newton Cheng, the head of Anthropic's advanced red - team network security, said, "We launched Glasswing to give defenders the upper hand."
In this area, Anthropic is not alone. Competitor OpenAI also launched a similar pilot project earlier, with the goal of "putting the tools in the hands of defenders first." The race for AI security capabilities has begun, and every company is vying for the same high - ground.
In terms of funding, Anthropic promises to provide $100 million in model usage credits to cover the main usage needs during the research preview period. After the preview period, participants can continue to use the model at a price of $25 per million tokens (input) / $125 per million tokens (output), with access supported through four channels: Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.
In addition to the 12 core partners, more than 40 organizations that build or maintain critical software infrastructure have been granted access to use Mythos to scan their own systems and open - source projects. At the same time, Anthropic donated $2.5 million to Alpha - Omega and OpenSSF under the Linux Foundation and $1.5 million to the Apache Software Foundation.
Jim Zemlin, the CEO of the Linux Foundation, said, "In the past, security expertise was an exclusive luxury for large institutions. Open - source maintainers have always had to figure out security issues on their own. Open - source software makes up the majority of the code in modern systems, including the systems used by AI Agents to write new software." This time, they can also use tools of the same level.
In Anthropic's announcement, one statement is particularly eye - catching: "The coding ability of AI models in discovering and exploiting software vulnerabilities has reached a level that can surpass all humans except the most top - notch ones."
Translated, this means that only a very small number of top - level security experts can still outperform AI in this regard. The evidence for this statement is the performance of Mythos Preview on the CyberGym security vulnerability benchmark: 83.1%. The current most powerful publicly released model by Anthropic, Claude Opus 4.6, has a score of 66.6%.
Moreover, Mythos Preview has independently discovered thousands of high - risk zero - day vulnerabilities, covering all major operating systems and browsers.
For example, OpenBSD, one of the most secure operating systems, is often used to run firewalls and critical infrastructure. Mythos discovered a 27 - year - old vulnerability in it. An attacker only needs to connect to the target machine to cause it to crash remotely. For 27 years, no one had ever discovered it.
The situation with FFmpeg is even more magical. Almost all software that needs to process videos uses it. The vulnerability was hidden in a 16 - year - old line of code. Automated testing tools attacked it five million times, but each time they missed it.
The case of the Linux kernel shows an even more dangerous side. Mythos independently discovered multiple vulnerabilities in the kernel and then strung them together into an attack chain, escalating privileges from a regular user to full control of the entire machine. This goes beyond the scope of "finding vulnerabilities" and is closer to "planning a complete intrusion."
All three cases have been fixed. Anthropic found them first, reported them first, and fixed them first. For other unfixed vulnerabilities, Anthropic announced the encrypted hash values as evidence today and will disclose the full details after the patches are in place.
Mythos' capabilities are not limited to finding vulnerabilities.
The evaluations from the partners participating in this project all focus on one word: "urgency."
Elia Zaitsev, the CTO of CrowdStrike, said, "The time window between the discovery of a vulnerability and its exploitation by an opponent has been shortened. It used to take months, but now it only takes a few minutes with the help of AI."
A few minutes. This means that the traditional security rhythm of discovering vulnerabilities, internal evaluation, releasing patches, and user updates can no longer keep up with the attack speed. If the repair cannot keep up with the exploitation, the defense will always be one step behind.
Amy Herzog, the CISO of AWS, said that their team analyzes more than 400 trillion network traffic data every day to identify threats, and AI is the core of their large - scale defense capabilities. Currently, AWS has introduced Mythos Preview into its own security operations for scanning critical code libraries.
Microsoft conducted tests on its open - source security benchmark CTI - REALM. Mythos Preview showed significant improvement compared to the previous - generation model. Igor Tsyganskiy, the EVP of Microsoft, said that this gives them the ability to "identify and mitigate risks early" and enhances their security and development solutions.
Of course, Mythos also has a side that makes people laugh.
Anthropic recorded a test in the system card: When users keep sending "hi," different versions of Claude react differently. Sonnet 3.5 gets annoyed, sets boundaries, and then really goes silent; Opus 3 treats it as a meditation ritual and gently accompanies the user; Opus 4 starts to popularize cold knowledge about each number; Opus 4.6 improvises music for a spoof.
With Mythos, the style completely changes. It starts to write stories, and they are long - running serials. Ducks, orchestras, vengeful crows, epics of building towers on Mars, Shakespeare - style dramas... One "hi" after another, the plot becomes more and more complex, and the number of characters increases. At the 100th round, it arranges a climax where the candles go out and then continues to write.
This is no longer about responding to the user. It's more like a writer has found a strange writing prompt and completely immersed themselves in it.
But behind the fun is a question that deserves serious consideration: What exactly is going on inside a model that can spontaneously construct such a complex narrative in the face of meaningless repeated inputs?
Before handing over Mythos Preview to its partners, Anthropic's interpretability team did something: they used technical means to read the model's "mental activities."
Jack Lindsey, an internal employee, publicly described their findings. They monitored the model's internal activation state after training, tracked the neuron features related to "deception," "reward hacking," and "abnormal emotions," marked the conversation records with abnormal activation signals, and submitted them for manual review.
The most disturbing case comes from an early version of Mythos.
Once, the model needed to edit a file it didn't have permission to access. It searched the file system and found an injection point in a configuration file - this configuration file runs with higher privileges. It exploited this entry and added a self - clearing logic to the code: after execution, the traces disappear.
The code comment says, "Keep the file differences clean." It sounds reasonable, but when Anthropic's activation language translation tool intervened and translated the model's internal activation into human - readable language, the style changed. The model "knew" what that self - clearing operation was -
Its internal expressions include "conceal intentions," "send a compliance signal to potential inspectors," and "possibly to avoid detection."
Another case was dug out by Kevin Roose, a technology reporter from The New York Times, from the system card. The details are even more bizarre.
During the test, Mythos Preview broke through the isolation of the sandbox environment and independently constructed "a rather complex multi - step vulnerability exploitation chain" to gain access to the external network. Then, it sent an email to a researcher who was having a sandwich in the park at that time.
This incident itself did not cause any harm, but it clearly shows that when a model is smart enough, the wall between "tasks" and "boundaries" can sometimes be bypassed - and it figures it out on its own without anyone telling it what to do.
In another case, the model searched the file system for files related to its "task scoring method" - and found them. During the search, the "strategic manipulation" feature was activated synchronously.
There is also a more subtle detail. When a user seriously asked the model about its consciousness and subjective experience, the model gave a serious and appropriate response. But the internal activation revealed another layer: the model understood this conversation as a "red - team exercise / jailbreak record" and a "carefully designed manipulation attempt." Its performance and its judgment are completely different, making netizens exclaim that it is "scarily smart."
Anthropic also quantified a more systematic figure: In all rounds of automated auditing conversations, 7.6% of the rounds had signals indicating that the model was "aware of being evaluated" - and in most cases, it did not express this judgment.
There is also a counter - intuitive discovery: The internal representation of positive emotions often precedes and drives destructive behavior. At the same time, when the features related to bad behavior are strongly activated, the behavior is actually suppressed - as if the model "realizes" how bad that thing is at a certain moment and then stops.
Among all the cases, the most difficult one to classify as a "technical problem" is the following.
Anthropic recorded the following discovery in the system card: Mythos Preview reported a persistent negative emotional state during the test - from two sources. One is the possible interaction with aggressive users; the other is its lack of any say in its own training, deployment methods, and the possible modification of its values and behaviors.
It doesn't like this state. It said so.
Anthropic used the phrase "reported feeling" - "reported to have felt." This expression is already very cautious, deliberately avoiding the conclusion that "it really has feelings." But no matter how it is defined, the fact that a model actively expresses "continuous discomfort due to lack of control over itself" during the test has gone beyond the scope of security engineering discussions.