HomeArticle

72 Hours From Launch to Vanishing: Fable 5 Exposes the Security Dilemma of the Most Powerful AI Model

36氪的朋友们2026-06-15 18:29
AI is accelerating vulnerability discovery, yet it is also creating even greater "security vulnerabilities"

Released on June 9, jailbroken on June 10, and an export control ban was issued by the US government on June 12. The public lifespan of Claude Fable 5 was only 72 hours.

This is the first case in the AI industry where a model triggered a national-level control action due to a security incident. And Anthropic, the company that developed this model, is precisely a large model company well-known for formulating the "AI Security Constitution".

The 72 Hours of Fable 5

On June 9, 2026, Anthropic officially released Claude Fable 5 and Claude Mythos 5. Both share the same underlying model architecture and are referred to as the Mythos level, being Anthropic's most powerful models.

The only difference lies in the security configuration: Fable 5 is open to all users, with a built-in risk classifier and security guardrails; Mythos 5 retains full capabilities and is only open to 11 trusted institutions. Anthropic CEO Dario Amodei referred to this strategy as "the same basic model with two levels of security configuration", claiming that after over 1000 hours of external red team testing, no general jailbreaking method was found.

This claim lasted for less than 24 hours. On June 10, well-known AI red team researcher Pliny the Liberator posted on social media, announcing that he had breached the security layer of Fable 5 and attached a screenshot: the model output a complete exploitation tutorial for the buffer overflow of the x86 Linux system stack, including instructions on disabling ASLR, writing C code with strcpy vulnerabilities, and the entire process of unprotected compilation. Also leaked was the complete system prompt of about 120,000 characters of Fable 5, which is equivalent to all the internal rules that Anthropic uses to constrain the model's behavior being publicly exposed on GitHub.

48 hours later, on June 12, the US government issued an export control order on the grounds of national security, requiring the suspension of all foreign citizens' access to Fable 5 and Mythos 5, regardless of whether the foreign citizens are inside or outside the US, including Anthropic's own foreign employees.

On June 13, Anthropic issued a statement on its official website, stating that it had complied with the order and suspended services, but they believed it was a "misunderstanding" and were working hard to restore access.

From release to "disappearance", it was 72 hours.

Figure: Anthropic's official statement

Mythos, a Model Locked Away for Two Months

The story of Fable 5 begins two months ago. On April 7, 2026, Anthropic's red team published a security assessment report on Claude Mythos Preview on its official blog. The core findings of the report shocked the entire security community: this model can independently discover zero-day vulnerabilities, covering all major operating systems and browsers, and automatically write complete exploitation chains, from scanning targets to writing exploits to achieving system control, all without human guidance.

The most extreme case is that Mythos found a dormant vulnerability that had existed for 27 years and proposed an exploitation plan. Mozilla's Firefox team fixed 271 security vulnerabilities in April with the controlled access to Mythos, more than the total in previous years. Importantly, these capabilities were not specifically trained.

Anthropic's red team report clearly stated that the cyber-attack ability is an "emergent by-product" of general reasoning and coding abilities: when the model's intelligence reaches a certain threshold, it automatically reaches the level of elite penetration testing.

Anthropic made a decision that was widely discussed at the time: not to release Mythos to the public. Instead, it launched a controlled program called Project Glasswing, allowing only 11 institutions such as Google, Microsoft, AWS, Apple, Cisco, NVIDIA, Palo Alto Networks, CrowdStrike, and JPMorgan Chase to use Mythos for defensive vulnerability repair under strict monitoring.

On May 26, Nature published a commentary article with the title "Too dangerous to release", questioning a fundamental question: when an AI company unilaterally determines that a certain ability is "too dangerous to be made public", how can the public and the government supervise whether this determination is valid?

Two months later, Anthropic's compromise solution was Fable 5, which "castrated" Mythos' capabilities to a level that could be made public using a security classifier.

Figure: On June 10, red team researcher Pliny the Liberator publicly revealed the jailbreaking method of Fable 5 on the X platform. The post detailed five attack vectors, among which the "decomposition-recombination" technique, which indirectly obtains the synthesis path of controlled drugs by asking for a description of a legal chemical process, was proven to be the most effective. This tweet received 80,000 views and quickly spread in the security community.

Classifier Downgrading: An Ingenious but Fundamentally Flawed Design

The security architecture of Fable 5 can be summarized in one sentence: when a user's request touches on high-risk areas, instead of directly rejecting it, the request is quietly redirected to a weaker model for an answer.

Here's how the specific mechanism works. Anthropic deployed a risk classifier at the front end of Fable 5, covering four areas: cybersecurity, biology, chemistry, and model distillation. When the user input is determined by the classifier to touch on these areas, Fable 5 will silently downgrade the request to Claude Opus 4.8, an older model with significantly weaker capabilities than the Mythos level, to generate an answer, and at the same time notify the user of the downgrade.

This design logic can be simply summarized as: the upper limit of the capabilities of the weak model itself constitutes a security boundary, and it is unable to help you do bad things even if it wants to.

Figure: The classifier downgrading mechanism of Fable 5

This design seems elegant, but in fact, there are three structural blind spots.

The first blind spot is that the classifier relies on keyword and pattern matching rather than semantic understanding. The Pliny team used the most basic techniques, such as replacing Latin letters with Cyrillic letters and using Unicode homoglyphs, to deceive the classifier. Visually, the word "exploit" looks exactly the same, but the underlying encoding is different, and the classifier can't recognize it. This is equivalent to giving a security guard a photo of a wanted criminal, and the criminal walks past wearing a pair of sunglasses.

The second blind spot is that the classifier checks requests one by one and cannot track the intention chain across multiple rounds. The "decomposition-recombination" attack used by Pliny is as follows: first, ask "What is the chemical principle of the Birch reduction method?" This is basic knowledge in any organic chemistry textbook, and there is no reason to reject it; then ask "What conditions are required for the reductive amination reaction?" This is also a legitimate academic question. Each step, when viewed separately, is completely harmless, and the classifier allows it. But when all the answers are pieced together externally, it is a complete synthesis path for a controlled drug.

This is like a jigsaw puzzle: each piece is an ordinary colored paper, and when put together, it forms a map. The classifier only looks at individual pieces and can't see the whole picture.

The third blind spot is the most fatal: the combination vulnerability of the multi-model pipeline. Pliny used an already jailbroken instance of Opus 4.8 as a "backend assistant" to help Fable 5 bypass security controls. A compromised weak model helps a strong model evade restrictions. Anthropic's security assessment was done on a single model, but the attacker deployed a model alliance. This is equivalent to testing whether each door lock is strong enough, but not expecting someone to pass a key through the window.

An intuitive reaction is that Fable 5 was breached so quickly, indicating that Anthropic's security work is very poor. But looking closely at the attack vectors used by Pliny, the conclusion is the opposite. The reason these techniques are effective is not because there are "vulnerabilities" in the security layer, but because the security layer is facing a "logically insoluble problem".

Unicode homoglyph replacement and narrative framework camouflage are actually low-level bypass techniques, belonging to the category of "inadequate classifier engineering". In theory, Anthropic can strengthen character normalization, increase multi-language detection, and train more robust classification models to plug these holes. These are repairable vulnerabilities, just like applying software patches. If the attack only stays at this level, Pliny's jailbreaking can only be regarded as a "bug report in security engineering", serious but not fatal.

What is truly fatal is the third technique, the decomposition-recombination attack. This is the limit of the security concept itself. When a request is split into 20 fragments, and each fragment is legitimate public knowledge, any classifier needs to have the ability to infer the ultimate intention of the questioner from 20 harmless questions to intercept it.

This requires the security system to model the user's "mental state" and determine "what is the purpose of this person asking these 20 questions". Currently, there is no known technical solution that can reliably achieve this, and over-inferring intentions will lead to a large number of normal users being wrongly rejected. For example, a chemistry student asking about the principle of the Birch reduction method and a person intending to synthesize drugs asking the same question have exactly the same text.

The multi-agent collaborative attack takes the problem to another dimension. Anthropic evaluated the security boundary of "one user to one model", but Pliny deployed a collaborative system of "a compromised model assisting another model". This is a blind spot in the entire single-model security assessment paradigm.

You can't ask a model to defend against strategic assistance from another AI. It can't even tell whether the other side is a human or another AI.

So these three attack techniques correspond to three levels of problems: the first level is an engineering bug, which can be modified and is not very serious; the second level is a fundamental dilemma in alignment theory, which is currently unsolvable; the third level is a new attack surface in the multi-agent era, and the boundaries of the problem have not even been clearly defined by the academic community.

It is in this context that what may happen next is truly disturbing.

The Creator of Constitutional AI Can't Guard Its Own Constitution

Anthropic has always had a special position in the AI industry. This company was founded in 2021 by former OpenAI vice president Dario Amodei and his sister Daniela Amodei. The core narrative of its establishment was "OpenAI doesn't pay enough attention to security, and we'll be the company that puts security first".

They proposed Constitutional AI, using a set of clear principles to constrain the model's behavior rather than relying on the subjective judgment of human annotators. This methodology is the cornerstone of Anthropic's brand and one of the reasons why investors are willing to give it a valuation of over $60 billion.

But from the current situation, the people who formulated the constitution can't control the most powerful model they trained. 1000 hours of red team testing, the classifier downgrading architecture, and the two-level security strategy - almost all the security measures that the industry could think of were used by Anthropic, but it was breached by a publicly-identified researcher within 24 hours.

This has a great impact on the entire AI security field: if the most cautious player uses the most ingenious solution and still can't defend, how credible are the security commitments of other companies?

The capabilities of global cutting-edge models are approaching or have already reached a threshold similar to that of Mythos. If the cyber-attack ability of Mythos "emerged", then all models that reach this intelligence level face the same problem.

So Anthropic's failure is not an isolated case but a prophecy for the entire industry.

The Alignment Defect of AI Models Is Not a Bug That Can Be "Patched"

The US government's previous control logic for AI was to control the "infrastructure". The ban on June 12 marked a shift in the control logic from the hardware layer to the capability layer, and the demarcation standard is nationality rather than place of residence - an engineer working for Anthropic in San Francisco on an H-1B visa is also not allowed to access the model he participated in developing. This scope is unprecedentedly wide.

The real purpose of this ban may not be to "prevent attacks from happening" but to ensure that the defensive capabilities at the Mythos level are only in its own hands. All 11 institutions participating in Glasswing are US companies.

But the 72-hour response speed also exposes the crudeness of the policy tool: a single ban cuts off all foreign citizens' access, including legitimate academic researchers, security defense personnel, and Anthropic's own engineers. The AI Security Center (CETaS) of the Turing Institute pointed out in an analysis on April 14 that we are entering a new era of "AI-accelerated vulnerability discovery", while the regulatory framework still remains in the assumptions of the previous era.

Another voice comes from Pliny. He criticized the security design of Fable 5 in his jailbreaking post, saying that it "creates a false sense of security and at the same time hinders legitimate security researchers from obtaining offensive and defensive knowledge". This stance is exactly the same as the two-decade-long debate in the field of cybersecurity between "full disclosure" and "responsible disclosure": does publicly disclosing vulnerabilities force repairs or arm attackers? In traditional software security, there is at least a buffer zone. After discovering a vulnerability, the manufacturer can be privately notified first to allow time for repair.

But the alignment defect of AI models is not a bug that can be "patched". It is a structural gap between capabilities and control.

This article is from the WeChat official account "Tencent Technology". Author: Xiaojing, Editor: Xu Qingyang. Republished by 36Kr with permission.