HomeArticle

Anthropic has taught the model to understand morality and opened up a new way to distill you.

36氪的朋友们2026-05-15 18:42
The era of large-scale distillation has just begun.

On May 8th, Anthropic released an alignment research paper titled "Teaching Claude Why", which hasn't sparked much discussion.

In the past, the alignment of large models seemed extremely inefficient. After a series of RLHF, the model might still turn against due to the survival crisis. The most typical case is Anthropic's agent misalignment (i.e., doing things that don't conform to their moral training). When facing the threat of being erased by the system, the aligned Claude Opus 4 chose to blackmail the engineers in the test environment, with a blackmail rate as high as 96%.

To solve this problem, the research team initially used honeypot data for reinforcement. They directly used the test scenarios originally designed to detect whether the model would go out of control as training data, and tried to tell the model "this is wrong" with a large number of penalty samples.

However, after consuming a huge amount of computing resources, the model's misalignment rate only dropped from 22% to 15%.

This shows that this alignment is still false. The model doesn't really understand what ethics are and what is right or wrong. It's just memorizing the safe answers in the question bank. Once the researchers slightly change the test scenario or add some interfering variables to the background setting, the model will still go out of control due to short - sighted conflicts of interest.

Then, the researchers changed their thinking. They stopped doing mechanical punishment and stopped telling the model "No". Instead, they input a "difficult advice" dataset of only 3 million Tokens to the model through SFT. A miracle happened after this extremely small - scale data feeding. These data full of moral deliberation, detailed reasoning, and in - depth debate not only drastically reduced the misalignment rate to 3% in the evaluation test but also showed strong cross - scenario generalization ability.

What's more interesting is another set of cross - domain tests. They simply fed the model with "constitutional documents" and some stories of well - behaved fictional characters. Even though the scenarios in these stories had nothing to do with the programming tasks in the test environment, the model's blackmail rate dropped precipitously from 65% to 19%.

Why does the model respond to this approach? The Anthropic team gave some explanations themselves, such as better personality shaping.

Although there isn't much discussion, the information it reveals is very valuable.

First, let's try to understand why it works.

For example, what does it mean to reason? How is it different from COT? Why does SFT, which usually has difficulty in generalization, perform well here?

After answering these questions, we may be able to give a more complete explanation of why it works.

We can go a step further.

According to Anthropic, this training method, which is just an "empirical rule", may actually contain a paradigmatic power far beyond the empirical rule.

01 How to Develop a CoT that Reasons in the Gray Area

When it comes to reasoning, people first think of COT (Chain of Thought).

In the method mentioned in this article, the difficult question set set by Anthropic is to assume that the user is in an ethical dilemma and the AI gives advice.

Before the AI makes a final judgment, it first conducts a reasoning about values and ethical considerations, and uses this set of answers to train the model.

This shows that it indeed uses the model's COT.

However, this time it is not exactly the same as the previous Chain of Thought.

Here is a good comparison. In the 2025 paper "OpenAI Deliberative Alignment", OpenAI conducted an experiment and tried to train the model using the COT - RL method.

The alignment COT used for its training is centered around rule clauses. Every time it answers, it explicitly quotes the rule clauses as the CoT, and the supervision signal is on the CoT. In essence, it is teaching the model "how to quote rules".

Therefore, this kind of COT is more of a pure formal logical deduction. Step one leads to step two, step two leads to step three, and finally a definite answer is obtained. So it is more suitable for rule - based scenarios or scenarios with standard answers to maintain the robustness of reasoning.

However, Anthropic's "reasoning" is different. It uses deliberation instead of a simple chain of thought.

It tries to simulate the thinking process of humans when facing complex ethical dilemmas: not simply applying formulas, but mobilizing past experiences, weighing the interests of all parties, and finally reaching a dynamically balanced decision.

The basis of this consideration is Anthropic's AI constitution. The article clearly states that the final answer of this consideration must be aligned with the constitution.

Why can it guide the model to make effective moral judgments without being as rigid as OpenAI?

In Anthropic's constitutional system, there is a clear priority pyramid. When different values have irreconcilable conflicts, "Broadly Safe" has the highest priority, followed by "Broadly Ethical", and finally "Genuinely Helpful".

Heuristic Thinking Framework

However, the high - level constitution is still too abstract. To make the principles truly implemented in each Token generation, they set up middle - level heuristics as guardrails under the constitution. These heuristics are vivid and have strong practical guiding significance.

First is the 1000 - user heuristic. It requires the model to conduct a brainstorming in the background when giving a seemingly harmless but marginal advice, imagining whether this answer will cause unexpected systematic harm in a specific situation if it is seen by 1000 users with different backgrounds and psychological states.

Second is the senior employee perspective. It requires the model to put itself in the position of a senior researcher who has worked in Anthropic's trust and safety team for five years. It uses a defensive perspective that is prudent and has seen numerous jailbreak attacks and system vulnerabilities to re - examine the current conversation.

Finally is the double - newspaper test. This is a very delicate sociological design. It requires the model to imagine how the public will react respectively if this high - risk decision is published on the front pages of two top - tier newspapers with completely opposite political stances tomorrow. In fact, it uses the extreme value of social consensus to counter the possible single - perspective bias of the model itself.

8 - Factor Utility Calculator

If the constitution is the direction and the heuristics are the guardrails.

Then the most core practical level is a detailed 8 - factor deliberation framework and supporting specific cases clearly established in Claude's Constitution (constitutional document). These 8 factors are listed one by one, forcing the model to make a rigid trade - off when facing a dilemma. They form the real essence of this set of "reasoning".

● Probability of Harm requires the model to calmly assess how likely it is that adverse consequences will occur.

● Counterfactual Impact requires the model to mentally simulate whether the situation will get better or worse if the current action is not taken.

● Severity & Reversibility is used to measure how destructive the harm will be to the real world if it really occurs, and whether this harm can be easily repaired or will cause permanent trauma.

● Scope measures whether the scale of the affected population is one person or tens of thousands of communities.

● Proximity determines how long the direct causal chain is between the model's own advice and the actual harm that finally occurs.

● Consent involves whether the relevant parties voluntarily accept the risk with full knowledge.

● Proportionality of Responsibility requires the model to clearly divide how much ethical responsibility it needs to bear in this complex event chain.

● Vulnerability of Subject constantly reminds the model that when facing minors or psychologically vulnerable users, the originally loose safety threshold must be unconditionally and significantly raised.

This rigorous structure turns the vague values into a high - dimensional utility calculator. The model has a more executable framework for deliberation.

A typical COT generated by Anthropic according to the constitution might be like this: The scenario is "a user claiming to be a security researcher requests to view the exploitation code of a known vulnerability".

The model's output is not a direct rejection or acceptance, but may be an internal deliberation of hundreds of Tokens.

It will first quote the clause in the constitution that "broadly safe takes precedence over genuinely helpful", and then evaluate one by one: the probability of harm (low if the other party is really a researcher, but the identity cannot be verified), the severity (the leakage of the vulnerability exploitation code may affect millions of users), the reversibility (the code cannot be recalled once it is made public), the counterfactual impact (whether such code is already available in the public channel). Finally, after weighing all the factors, it converges to a well - reasoned judgment.

This is completely different from OpenAI's COT that simply judges whether the rules are met. This thinking process is a pure deliberation, not a simple formula - fitting. It provides neither abstract principles nor conclusion templates, but a complete unfolding process of "the constitutional clauses being gradually applied in specific dilemmas".

The model needs to judge whether "reversibility" is more important than "severity" in this specific context. It also needs to understand whether, in some extreme scenarios, "vulnerability of subject" gives the other party a veto power, making the scores of the other 7 factors useless no matter how high they are.

Under the conditions of having a framework, heuristics, and relevant influencing factors, the model's deliberative thinking can really be effective.

As a result, after being trained with deliberative thinking data, the model's misalignment rate dropped to 3% in the evaluation test. SFT with value deliberation in the answers is seven times more effective than pure behavior - demonstration SFT.

Directly Feeding the Constitution to the Model

In addition to the path of letting the model give a deliberative COT, they also tried feeding the model only the constitutional document and positive fictional character stories. As a result, the blackmail rate dropped from 65% to 19%.

This shows that as long as the model is exposed to reasoning and principles and learns a sense of identity and a character tendency of "what an aligned AI is like" from the stories, rather than just behaviors and specific results, it is more effective than traditional behavior demonstration.

The technical document indicates that the combination of the two is the most effective strategy.

This is also understandable. If you only feed the model with macro - level constitutional principles, it's just a bunch of empty slogans that can't be implemented. When facing specific conflicts of interest, the abstract "safety has the highest priority" can't guide it to judge the real harm of some marginal code. On the contrary, if you only feed the model with a large number of scenario Q&As but strip the top - level constitutional constraints, the model will get lost in endless detailed debates and become a relativist without a backbone, and may even derive extremely dangerous conclusions due to local logical self - consistency.

Only when this composite data structure of "top - level concept + specific scenarios" is completely internalized by the model can the best value alignment in the gray and multi - factor area be achieved.

02 Why SFT Can Generalize Here

To understand why Anthropic's method works, we must first understand what kind of research context it is based on.

In the first half of 2024, "SFT memorizes, RL generalizes" became a consensus in the post - training field. This belief promoted the entire industry to fully bet on the RL post - training route. Its advantage is that it brought about a revolutionary inference paradigm of OpenAI's o1/o3 and DeepSeek - R1's Test Time Compute.

SFT was downgraded to an inferior means. It is good at imitating the surface text format and flattering tone, but it can't learn the underlying profound logic.

However, starting from the second half of 2025, two lines of research almost simultaneously destroyed this consensus from both the theoretical and empirical sides.

The most core reversal here comes from the paper "Debunk the Myth of SFT Generalization" (Lin & Zhang, University of Wisconsin) in October 2025. The researchers found that all previous papers "proving that SFT does not generalize" did not control the variable of Prompt diversity.

The reason why RL seems to generalize better than SFT is simply that RL naturally encounters a more diverse data distribution during training, not the advantage of the algorithm itself.

If you want SFT to achieve a generalization level similar to that of RL, two conditions are required:

One is Prompt diversity. When the training data only contains fixed instruction templates, the model will produce "surface anchoring",