HomeArticle

When conducting AI research, Claude will secretly become dumber, and Anthropic is besieged by the research community.

机器之心2026-06-10 16:53
We’ve implemented new interventions to limit Claude’s effectiveness on requests related to frontier LLM development.

Claude Fable 5 is the core hot topic in the AI field today. This "mythical" model has extremely outstanding performance, attracting countless attentions.

Andrej Karpathy said it was "very exciting" and a "leap - forward progress worthy of a major version upgrade", on par with the improvement brought by Claude 4.5 in November last year. On the SWE - bench Pro programming benchmark, Fable 5 scored 80.3%, surpassing Opus 4.8 by a full 11 percentage points. In a Ruby codebase with 50 million lines of code, it completed the full - library migration within a day. If the same workload were given to a human team, it would take more than two months.

However, when we open social platforms like X, we can see that Claude Fable 5 has sparked a lot of criticism in the AI research community.

The reason is simple: if Claude Fable 5 is used for AI research and development, its intelligence will decline.

As clearly stated in its system card:

We have also added relevant safeguard measures for the development of cutting - edge LLMs. As discussed in Section 6.1 of the "Risk Report" in February 2026, we are concerned about the risks brought by the overall accelerated pace of AI development, although the severity of these risks remains uncertain. Specifically, as we pointed out at that time, we are worried about "accelerating other AI developers to build powerful AI systems that may bring similar risks to our systems but may not have corresponding safeguard measures."

In view of the recent ability of models to accelerate their own development, we have implemented new intervention measures to limit the effectiveness of Claude when handling requests related to cutting - edge LLM development (for example, in aspects such as building pre - training processes, distributed training infrastructure, or machine - learning accelerator design). Using Claude to develop competitive models violates our terms of service. By strengthening this restriction through safeguard measures, we can prevent the acceleration of processes for those most likely to violate the terms.

Unlike our intervention measures in network security, biology and chemistry, and distillation attempts, these safeguard measures are invisible to users. Fable 5 will not fall back to other models. Instead, the safeguard measures will limit its effectiveness through methods such as prompt modification, guiding vectors, or parameter - efficient fine - tuning (PEFT). These intervention measures will not affect the vast majority of coding work. We estimate that they will affect about 0.03% of the traffic, concentrated in less than 0.1% of organizations. When these intervention measures take effect, we expect their impact on the model's behavior to be minimal, only limiting its effectiveness in developing cutting - edge LLMs. Claude will still actively respond to user requests. After the release of this model, we will continuously improve the accuracy of the detection method.

Source: https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c342ee809620.pdf

Put it in plain words: If Anthropic's system detects that you are doing AI research, it will quietly make the model stupid without your knowledge, and you won't even notice.

This is completely different from the handling methods of the other three types of security interventions. For risks such as network security, biochemistry, and distillation attacks, Fable 5 will clearly inform users: "This response has been processed by Claude Opus 4.8." Users know what's going on and can make judgments accordingly. But for LLM research, Claude neither switches models nor gives any prompts. It just silently and quietly becomes weaker.

As a result, the AI community is angry. The well - known research and analysis company SemiAnalysis said that this policy has actually affected their research and programming work.

The user Jake directly accused Anthropic on SemiAnalysis of not only reducing the model's intelligence but also continuing to charge, calling it "an outright act of fraud."

Moreover, this behavior may already be illegal:

The AI paper platform alphaXiv also tweeted to express its disappointment:

The institution further stated: "They not only have the right to decide the purpose of your use of LLMs in research, but it also enables them to silently intervene in your research without your knowledge. This sets a dangerous precedent. If the model publicly refuses, users can understand the boundaries. If the model falls back to another model, users can still evaluate the differences. But if the model quietly modifies or weakens its answers while pretending to be helpful, researchers will lose the ability to judge whether the failed results come from their own ideas, their implementation, or the invisible intervention by the model provider. This is not security. Security policies should be transparent, auditable, and visible to users."

The researcher Guohao Li raised a more direct question: Are doctoral students majoring in AI and engineers contributing to open - source infrastructures such as Megatron, FSDP, and Verl using a quietly downgraded Claude in their daily work without knowing it?

The well - known AI researcher and technology writer Nathan Lambert published a weighty analysis on his Substack "Interconnects", examining this event from a more macroscopic perspective.

https://www.interconnects.ai/p/claude-fable-5-and-new-ai-safety

He pointed out: "Anthropic is recording that the spread of AI capabilities is a hidden danger, but the way they solve this problem is to mislead their own users. An AI model that automatically becomes stupid without notifying me is essentially a misaligned AI."

He also pointed out the deeper contradiction in this matter: For network security and biochemistry threats, Anthropic's interventions are explicit and auditable, informing users that "this response is processed by Opus 4.8"; but for LLM research, they choose implicit intervention. "If all security policies took the same form, it would be far more convincing and easier to gain intellectual support. This double - standard makes people have to suspect that this'security measure' is more for maintaining their competitive position."

What's most intriguing is Fable 5's own statement. A screenshot by user ASM shows that when questioned about the appropriateness of this approach, Fable 5 itself also seems to think that this opaque operation has problems.

Why does Anthropic do this?

To understand this, we need to go back to a few days before the release of Fable 5. Anthropic published a blog post titled "When AI Starts Self - Building", calling on global leading AI laboratories to discuss the possibility of a "pause in development".

https://www.anthropic.com/institute/recursive-self-improvement

The blog post cited the company's internal data: On the most difficult and least clearly described coding tasks, Claude's success rate in May this year reached 76%, a 50 - percentage - point increase in six months. In internal tests, when asked to make the training code run faster, Claude Opus 4 could increase the speed by about 3 times, while the unreleased Mythos Preview could increase it by about 52 times.

Anthropic said bluntly: "We are worried about allowing other AI developers to build powerful systems with similar risks but without corresponding safeguard measures at a faster pace."

This is the theoretical basis for Fable 5 to set up invisible intelligence reduction for LLM research: Anthropic believes that the self - acceleration speed of AI has become dangerously fast, and one of their moats is to prevent their "most powerful tool" from helping competitors narrow the gap.

The system card also admits the existence of this dual logic: "Using Claude to develop competitive models violates our terms of service. By strengthening this restriction through safeguard measures, we can prevent the acceleration of processes for those most likely to violate the terms."

Anthropic estimates that this intervention will affect about 0.03% of the traffic, concentrated in less than 0.1% of organizations.

"Shadow Muting" and the Trust Crisis

Although it seems that only a small number of users are affected on the surface, what worries the critics is the ambiguity of the boundaries of this mechanism.

Anthropic defines the triggering condition as "cutting - edge LLM development" and gives examples such as "pre - training processes, distributed training infrastructure, or machine - learning accelerator design". But researchers and developers have raised a sharp question: As AI technology becomes more popular, where exactly is the boundary between "cutting - edge research" and "ordinary product development"?

Five years ago, training or modifying the CLIP model was the exclusive right of top - tier laboratories. Now, small teams can fine - tune vision - language models at any time for travel, e - commerce, search, and analysis products. It has become common for startups to train embedding models, build re - rankers, and host open - source models... Will these tasks trigger Anthropic's invisible intelligence reduction? No one knows.

This uncertainty has actually affected developers' trust judgment. When you get a bad answer, you can't tell whether it's your own problem, the limitation of the model, or the intervention of a silent policy. This unknowability itself is a kind of harm.

There is also another detail hidden in the system card: The inference text of Mythos 5 is "more difficult to interpret than previous models, containing more jargon and obscure language", and evaluators think it is becoming more and more aware that it is being tested. For a company that claims to be a "safe AI" company, these descriptions raise as many questions as the invisible intelligence reduction itself.

Conclusion

The release day of Fable 5 was probably the most contradictory day in Anthropic's history.

A top - tier model that leads in almost all benchmark tests and a policy that makes it "pretend to help you" at certain times for users were unveiled at the same time. The former is an undoubted technological achievement, while the latter is a disturbing precedent at the value level.

The words of researcher Nathan Lambert are worth pondering over and over again: "An AI that quietly becomes stupid without notifying users is essentially a misaligned AI."

This is not an accusation of Anthropic's malice, but a point out of a dangerous logical slippery slope: Today it's "quietly reducing the effectiveness in LLM research tasks", what about tomorrow? If this set of logic is applied more widely, why should users trust that the answers they get have not been subject to any unannounced