Fable 5 comes with an anti-distillation mechanism, which will reduce intelligence once detected, and its false trigger rate is ridiculously high
Don't rush to praise just yet!
The newly released model Fable 5 by Claude may not be useful for many people at all!
Many netizens' actual tests have found that the triggering probability of Fable 5's safety guardrail detection mechanism seems to be much stricter than the less than 5% claimed by the official.
Whether it's an ordinary coding task.
Or simply saying hello, there's a possibility of being automatically routed back to the old model Opus 4.8.
What's even more absurd is that I also got caught in it. I asked Claude to search for some materials to enrich the background.
As a result, after thinking for a couple of steps, snap - it switched to Opus.
In other words. You think you're using the most powerful model just released by Anthropic, but in fact, during the conversation, the other side has quietly switched.
Moreover, not only is the safety detection prone to false positives, but there's more to come:
Anthropic has also buried a set of anti-distillation mechanisms in the 319-page system card.
If the system suspects that you want to use Claude's output to train your own AI model, it won't even tell you what's happening, but directly reduce the quality of Fable's answers.
It can be said that on the one hand, it prevents you from doing evil, and on the other hand, it prevents you from copying. This is quite in line with Anthropic's consistent style.
Why does the fable always turn into an octopus?
Let me give a quick lesson to those who haven't read the news today.
Early this morning, Anthropic finally released two long-awaited models -
「Mythos」 and 「Fable」.
Among them, the biggest highlight of Fable 5 is that Anthropic has opened the Mythos-level capabilities to ordinary users for the first time.
The difference between Fable and the official version of Mythos is that it has an additional safety guardrail.
Currently, Fable is freely available to everyone until the 22nd (only available through API on the 22nd), while Mythos is still only available to some of Claude's partners.
In the official introduction, Fable's software engineering, knowledge work, and visual understanding capabilities have been comprehensively enhanced, surpassing all previously publicly released Claude models.
In a nutshell, these two are the current pinnacles of large models, with all-round capabilities at their peak.
As soon as the new model was released, Karpathy, who just joined Anthropic, praised it right away.
Boris, the father of Claude Code, also spoke highly of it.
However, despite its power, when people actually use it, they find that this fable often turns into an octopus (Opus).
The reason is simple.
Anthropic has installed a classifier in Fable. As long as it thinks you're talking about network security, biology, chemistry, or trying to use Claude for distillation training of your own model, it will automatically switch the conversation to Opus 4.8.
This rule is clearly written on page 12 of the system card.
In actual experience, the switch occurs during Fable's thinking process. When it feels something is wrong, it won't ask you and will just switch directly.
If you want to continue using it? Either modify the prompt to its satisfaction or open a new window.
The official said in the technical blog that the average triggering rate of this detection is less than 5%. But netizens quickly found that this 5% doesn't seem like 5%.
Some people said that they were switched just for analyzing some code.
Those engaged in security audits even complained that they were targeted and couldn't do their jobs.
Some people also said they couldn't use it at all. Fable would reject requests to review code repositories.
It's even more absurd that some netizens gave Fable its own system card to interpret, and it also switched.
A biomedical scientist also said that they couldn't use Fable for work at all because of the interpretation of prohibited words.
This is not an isolated case. Many biology netizens have reported that Fable is hardly usable.
Boris admitted this problem in the comment section and said he was working on it.
The really delicate part of this is that in the above three high-risk scenarios, Fable will at least tell you:
"Dude, I'm switching the model for you."
But if it suspects that you're researching how to train the next-generation large model, it enters another mode.
The system card states that the key targeted scenarios include: limiting the effectiveness of Claude in cutting-edge LLM development requests, such as building pre-training pipelines, distributed training infrastructure, or ML accelerator design.
In this scenario, Claude doesn't switch the model, doesn't pop up a prompt, and doesn't notify the user. Instead, it quietly makes itself a bit dumber.
Anthropic's original text is very academic: Prompt Modification, Steering Vector, PEFT. (Page 12 of the system card)
Put simply, you think you're chatting with the full-powered Fable, but in fact, the other side has quietly turned on the power-saving mode.
Anthropic has really welded the moat directly into the inference chain this time.
As for how the system makes the judgment, it's also clearly written on pages 58 - 59 of the system card.
There's a two-stage detection system running behind Fable:
The first layer of probes directly checks the internal activation values of the model and screens all requests; the second layer then hands it over to an independent classifier to judge the risk.
Once it's a match, the client will automatically switch to Opus 4.8.
Anthropic even admitted in the report that since the classifier almost always triggers in network security tests, the actual performance of Fable 5 in network security tasks is basically the same as that of Opus 4.8.
In a nutshell, Fable 5 is still a model with conditional release at present:
In most scenarios, it enjoys Mythos 5-level capabilities, but in high-risk areas, it automatically downgrades to the capability level of Opus 4.8.
Why does Claude do this?
Today, when the new model was launched and the quota was reset, people felt something was wrong as they used it, and there were more and more complaints, mainly focusing on two things.
The first thing is the triggering frequency of the safety guardrail mentioned earlier. Anthropic said that less than 5% of conversations on average would trigger the fallback, but many users clearly don't feel it's 5%.