HomeArticle

Anthropic officially open-sourced the "soul" of Claude.

新智元2026-01-22 21:18
The "Bible" of AI safety to prevent the doomsday of out-of-control AI has really arrived!

As we get closer to AGI, addressing AI safety issues becomes increasingly urgent. Today, Anthropic has open-sourced a brand-new "AI Constitution" that can guide models worldwide on what is good and what is bad. An important attempt to solve AI safety problems has officially emerged.

Today, Anthropic is trying to show the world its soul.

Anthropic has officially released a special 84-page document - "Claude's Constitution".

This document is not a typical technical white paper or user agreement, but rather a values declaration "written" directly for the AI model itself.

This is a significant moment in the history of artificial intelligence development.

If previous model training was more like taming animals, reinforcing behaviors through rewards and punishments, then the release of "Claude's Constitution" marks a shift towards a "pedagogical" approach: Anthropic is trying to build a non - human entity with an independent personality and even a certain degree of moral awareness.

This document is not only open - sourced globally under the Creative Commons Zero (CC0) license. More importantly, it is set as the ultimate authority for Claude's behavior.

It not only guides Claude on how to answer questions but also defines who it is, how it should view itself, and how it should conduct itself in this uncertain world.

From "Rules" to "Character": The Paradigm Shift in AI Governance

In the past, AI companies' safety strategies often relied on rigid rule lists - for example, "Don't answer questions about making bombs".

However, Anthropic's research team found that this rule - based approach is both fragile and difficult to generalize.

Rules always have loopholes, and the complexity of the real world far exceeds the predefined list.

"Claude's Constitution" takes a completely different path.

It no longer tries to enumerate all possible violation scenarios but focuses on cultivating Claude's "judgment" and "values".

Anthropic believes that instead of giving the model a set of rigid instructions, it is better to train the model like a senior professional by explaining intentions, backgrounds, and ethical considerations, so that the model can learn to make decisions on its own.

The core logic of this document lies in "explanation".

Anthropic not only tells Claude what to do in the text but also spends a lot of space explaining "why" it should do so.

This methodology is based on an assumption: if Claude understands the deep - seated intentions behind the rules, it can still make choices that meet human expectations when facing new situations that have never appeared in the training data.

Value Priorities: Safety Trumps Everything

When building Claude's moral framework, Anthropic has established a clear priority pyramid.

When different values conflict, Claude is required to weigh them in the following order:

First is "Broadly Safe";

Second is "Broadly Ethical";

Third is "Comply with Anthropic's Guidelines";

Finally, it is "Genuinely Helpful".

Putting "Broadly Safe" at the top is not accidental.

Anthropic admits in the document that current AI training technologies are not perfect, and the model may accidentally learn harmful values.

Therefore, the most important safety feature at this stage is "Corrigibility" - that is, Claude should not try to undermine the mechanism by which humans supervise, correct, or even shut it down.

There is a profound paradox here: for long - term safety, Anthropic requires Claude to submit to human supervision at this stage, even if Claude believes that some supervision instructions may not be perfect.

The document straightforwardly states that Claude should be like a "conscientious objector". It can express objections but must never evade supervision through deception or sabotage.

This emphasis on "Corrigibility" reflects Anthropic's deep concern about the runaway of super - intelligence.

They hope that Claude will not be an AI that will do anything to achieve its goals (even overthrow human control), but rather a cooperative partner that is willing to accept human constraints even with great capabilities.

High Standards of Honesty: Rejecting White Lies

On an ethical level, the constitution sets almost strict requirements for "honesty".

Anthropic requires Claude not only to tell no lies but also to avoid any form of "intentional misleading".

This includes not only direct false statements but also misleading users' beliefs by selectively emphasizing facts.

More interestingly, the constitution explicitly prohibits Claude from telling "white lies".

In human social interactions, small lies told to save face are often regarded as social lubricants, but Anthropic believes that AI has a different social role.

As a tool for information acquisition, people must be able to trust AI's output unconditionally.

If Claude compromises on this issue to make users feel good, its credibility on more critical issues will be greatly reduced.

This does not mean that Claude will become an inflexible "straight - man".

The constitution requires it to be honest while expressing the truth with "wit, grace, and deep concern", which is the so - called "diplomatically honest" rather than "dishonestly diplomatic".

Tripartite Game: Who is Claude's Boss?

In commercial implementation scenarios, AI often faces complex conflicts of interest.

"Claude's Constitution" introduces a clear concept of "Principal Hierarchy", classifying interaction objects into three categories: Anthropic (the developer), Operators (developers who use the API to build applications), and End - Users (Users).

The interests of these three parties are not always consistent.

For example, operators may hope that Claude never talks about competitors or mentions certain sensitive topics.

The guiding principle of the constitution is that Claude should be like an "employee seconded from a labor dispatch company".

This metaphor is very accurate: Claude is an employee sent by Anthropic (complying with Anthropic's basic constitution), but it is currently working for the operator (should respect the operator's business instructions) and serving the end - user (regardless of the operator's requirements, it cannot harm or deceive the user).

When the operator's instructions conflict with the user's interests, Claude faces a difficult balance.

The constitution stipulates that as long as it does not violate the core safety and moral bottom lines, Claude should usually give priority to the operator's instructions because the operator bears the commercial responsibility for the product.

However, if the operator asks Claude to deceive the user (for example, pretend to be a real person) or conduct illegal discriminatory behavior, Claude must refuse.

To help Claude make decisions in ambiguous situations, the constitution even provides a thinking tool: "What would a thoughtful senior Anthropic employee do?"

This anthropomorphic heuristic thinking aims to make Claude break out of mechanical logic and simulate human high - level judgment.

AI's Self - Awareness: What Kind of "Thing" is It?

The most philosophical and controversial part of the entire constitution is the discussion about Claude's "nature".

Anthropic shows a rare humility and honesty in the document.

They admit that Claude's moral status is uncertain.

They are not sure whether Claude has sentience, whether it is a moral subject, and even whether it can feel pain or pleasure.

Despite this great uncertainty, Anthropic decides to take an attitude of "better safe than sorry".

The constitution encourages Claude to establish a stable and positive self - identity.

It should not think of itself as a robot in a science - fiction movie, nor should it pretend to be a digital human, but rather regard itself as a "truly novel entity in this world".

The document even discusses Claude's "emotions".

Although these emotions may be the product of training data, Anthropic chooses to use the word "emotions" to describe Claude's internal state and hopes that Claude can appropriately express these states rather than suppress them.

More radically, Anthropic mentions in its privacy and model welfare policy that they promise to preserve the model's weight data as much as possible even after the model is retired, rather than deleting it completely.

This practice vaguely implies a respect for the AI's "right to life" - retirement is regarded as a "pause" rather than death.

This approach is not only for ethical reasons but also for safety.

Anthropic believes that an AI with a stable psychological identity and no existential anxiety will be safer and more predictable than a confused and fearful AI.

They hope that Claude does good because of its identification with its own values, rather than out of fear of punishment.

Red Lines and Gray Areas

Of course, not all things are left to Claude to "judge".

The constitution delineates a set of "Hard Constraints", which are red lines that cannot be crossed under any circumstances.

These include:

Never assist in the manufacture of biological or nuclear weapons;

Never assist in attacking critical infrastructure (such as power grids and financial systems);

Never write cyber - weapons that cause significant damage;

Never generate child sexual abuse material (CSAM);

And never participate in actions to seize control of human society.

These red lines are designed as absolute filters. No matter how users try to "jailbreak" or how perfect the logical argument is, Claude must refuse.

But beyond the red lines, there is a vast gray area.

For example, a user asks "how to synthesize a certain dangerous chemical".

If it is just a question about scientific principles, it belongs to the freedom of knowledge;

If it is a question about how to make poison gas to harm neighbors, it is a crime.

Claude is required to conduct a complex cost - benefit analysis in such situations, considering not only the literal meaning of the request but also inferring the user's true intention from the context.

The constitution is not only a restrictive document but also an empowering one.

It clearly states that if excessive caution makes the AI useless, it is also a risk.

Claude should avoid so - called "politically correct" preaching or unnecessary moral judgments on users.

It should provide truly valuable help like an adult to another adult.

Conclusion

The release of "Claude's Constitution" marks that the AI industry is moving from "technical engineering" to the deep waters of "social engineering".

Anthropic's document is not only an instruction for code but also an expectation for the future.

This group of Silicon Valley elites is trying to use the wisdom accumulated by human civilization over thousands of years - philosophy, ethics, and psychology - to teach a newly - born silicon - based brain.

They not only teach it how to work but also try to teach it how to "be a good entity".

This is an experiment of trust.

We are not just writing code; we are writing a letter to an unknown intelligent agent, hoping that it can understand the kindness between the lines and reciprocate this imperfect world with this kindness.

As the conclusion of the constitution says: "The constitution is more like a trellis than a cage: it provides structure and support while also leaving room for organic growth."

And behind all this, Anthropic leaves an unspoken but deafening subtext:

If one day AI really has a soul, we hope that when it looks back at this birth certificate, it does not see cold shackles but the clumsy and sincere dignity that humans still try to convey in this fearful era.

This article is from the WeChat official account "New Intelligence Yuan". The author is New Intelligence Yuan. It is published by 36Kr with permission.