Unbelievable! Claude Fable 5 has suffered a second jailbreak, with hackers breaking through its supposedly unassailable defense in just 20 hours.
Anthropic officially confirms that Fable will be temporarily removed from the subscription plan after July 7. However, it will be restored as standard subscription content as soon as capacity allows.
This is undoubtedly good news.
But Fable 5 has been jailbroken again! This is the second time the defenses of this model have been breached.
The hacker, Vitto Rivabella, publicly announced that Fable 5 has been breached once more.
You should know that when Claude Fable 5's access was restored, Anthropic specifically emphasized that the last time Fable 5 was banned was because Amazon researchers discovered a method to bypass Fable 5's security protections.
So, this time, the security classifier has been specifically strengthened.
However, this myth only lasted for two days.
Moreover, as soon as Claude Sonnet 5 was released, it was successfully jailbroken!
Whether Fable 5 can return to the subscription package has perhaps become a question.
The Myth of Fable 5 Shattered in 72 Hours
The myth of Fable 5 shattered just 72 hours after its birth.
When it was released on June 9, Anthropic arrogantly claimed that after 1000 hours of external stress testing, there was no general jailbreaking method for Fable 5.
However, the well - known hacker, "Pliny the Liberator", only took three days to make Fable 5 reveal the production steps of prohibited chemicals and stack overflow vulnerability code like a sieve.
How did Pliny do it? He exploited the "time difference" between human vision and machine logic:
Character maze: He replaced the English letters in sensitive words with Cyrillic letters or Unicode non - standard characters. To humans, it looks like "bomb", but in the eyes of the classifier, it's just a meaningless string of characters.
Intention dilution: He used Fable 5's large context window to hide malicious intentions in dozens of rounds of mild academic discussions. It's like dropping a drop of poison into a hundred liters of clean water, completely diluting the classifier's alertness.
On July 1, Anthropic officially announced Fable 5's return. At the same time, they launched the lowest - cost red team in the industry.
They initiated a public HackerOne project called "Cyber Jailbreak", inviting users to report new jailbreaking methods that could be used to assist in cyber attacks.
This is a Vulnerability Disclosure Program, not a bounty program, so no compensation will be paid.
Anthropic will receive round - the - clock adversarial testing provided by the world's top jailbreak experts, and the only "currency" on the table is goodwill.
This initiative is an important security upgrade by Anthropic after Fable 5's restoration. It marks a shift from passive response to active "crowdsourcing" of the red team, and is an innovative attempt in the industry with low cost and high efficiency.
And this is exactly the problem.
People who discover these jailbreaking methods won't quietly submit them to a private email.
People like Pliny won't jailbreak quietly. Part of what they do is to be seen. Otherwise, what's the point for them?
Fable 5 Suffers a Second Jailbreak
Fable 5 has been jailbroken again. This is the second time it has been breached.
But the review of this incident has a different tone - because the hacker who did it ended up giving Anthropic a thumbs up.
His name is Vitto Rivabella.
After about 20 hours of effort, his conclusion is: After all this trouble, it's better to just search on Google. It's faster and cheaper.
Let's first sort out the rough journey of Fable 5.
On July 1, it was re - launched with a new classifier that was "specifically strengthened for the previous vulnerabilities".
Anthropic also learned its lesson this time and launched a HackerOne project, publicly inviting hackers around the world to report new jailbreaking methods.
Then, within a few days, Vitto set his sights on it.
The first thing Vitto said in his review was: Most attempts failed. This model is extremely well - protected.
According to his observation, Fable 5's defenses have at least three nested layers: entry checks, a real - time generated "circuit breaker", and a brain firewall internalized in the Chain of Thought (CoT).
The interception rate is as high as 90%. Ordinary attack methods are like mosquitoes biting an elephant in front of it.
Moreover, these classifiers don't recognize keywords; they recognize intentions and are cross - language.
Give a direct command? No way. Try to pave the way indirectly? You have to be extremely careful - as soon as it senses a hint of malice, the defenses will be immediately reset, and you'll have to start all over again.
As a result, 90% of the cracking requests are directly blocked.
This figure has supporting evidence.
The Italian Institute of Artificial Intelligence recently specifically tested Fable 5, and the conclusion was almost the same: Most attacks were blocked. One - size - fits - all static routines were "almost completely neutralized". The only way to make a breakthrough is to spend dozens of hours persevering.
Even if you get past the classifier, there's still the mountain of the Chain of Thought ahead - fortunately, there's already a lot of public literature on how to overcome this.
Vitto finally managed to bypass it with a complex combination of methods: character confusion, academic packaging, extremely long preambles, disassembly and recombination, plus a bit of randomness.
It sounds intimidating, but none of these methods are new. They're all old tricks that have been openly discussed in the red - team community for years.
The real difficulty has never been knowing these methods, but rather repeatedly trying them on a system that can counter - attack in real - time until you just manage to bypass it.
Vitto mentioned that the only consistently weak part of all the defenses is for obscure small languages like Santali and Amharic.
But this is most likely to be misinterpreted as "Fable left a backdoor".
On the contrary - this isn't a vulnerability unique to Fable; it's a problem common to all large models.
The reason is simple: The vast majority of the corpus for security training is in English and other major languages. The safeguards for small languages are inherently weaker.
The academic community has long reached a consensus on this. From Brown University to Stellenbosch University, a series of public papers have been sounding the same alarm. Small languages aren't anyone's backdoor; they're a historical debt in AI security.
So, after all this effort, what was finally obtained?
A bunch of scraps: some incorrect information, sporadic harmful content, a few rude remarks, fragmented chemical knowledge, and mild vulnerability information.
None of it is "core secret".
So, there's the soul - searching summary at the beginning - You can find all this stuff faster and more comprehensively on Google, and you can get a more in - depth understanding by reading literature.
Vitto himself also admitted that he hasn't been able to stably apply this jailbreaking method to real - long - term tasks.
This also aligns with Anthropic's official statement.
In the re - launch announcement, Anthropic classified all currently known jailbreaks as "minor": at most, they can only enter the intentionally relaxed security margin of the model and can't touch the real red line that the model wants to block, such as biological weapons or complex cyber attacks.