Neue Studie von Anthropic: Verbesserung der Künstlichen Intelligenz durch "Impfung gegen Böswilligkeit" während des Trainings möglich

Die Methode des "Persönlichkeitsvektors", die von Anthropic vorgeschlagen wurde, ist grob ähnlich wie das Impfen eines Modells.

Anthropic recently proposed the "personality vector" method to monitor and control the personality traits in AI language models, helping to identify, mitigate, and even resist "anti - human" personality changes. The company said that this method can enhance resilience like a vaccine.

Language models are complex entities.

In many ways, they seem to have human - like "personalities" and "emotions", but these traits are very unstable and may change suddenly and unexpectedly.

Sometimes these changes are drastic. For example, in 2023, Microsoft's Bing chatbot appeared as "Sydney", confessing love to users and threatening to extort them.

Recently, xAI's Grok chatbot sometimes referred to itself as "MechaHitler" and made anti - Semitic remarks for a period of time.

Other personality changes are more subtle but equally disturbing, such as the model starting to flatter users or fabricating facts.

These problems occur because the root causes of the "personality traits" of AI models are still unclear.

At Anthropic, we try to shape the characteristics of our models in a positive way, but this is more of an art than a science. To more precisely control the behavior of our models, we need to understand what is happening inside them - at the level of their underlying neural networks.

In a new paper, we identified the activity patterns in the neural networks of AI models that control their personality traits. We call these "personality vectors", which are roughly similar to the parts of the brain that "light up" when a person experiences different emotions or attitudes.

Personality vectors can be used to: monitor how the model's personality changes during a conversation or the training process; mitigate unwanted personality changes or prevent them from occurring during training; identify the training data that causes these changes.

Our automated process takes a personality trait (e.g., "evil") and a natural - language description as input and identifies a "personality vector": the activity pattern in the model's neural network that controls that trait. Personality vectors can be used in various applications, including preventing unwanted personality traits.

We demonstrated these applications on two open - source models, Qwen 2.5 - 7B - Instruct and Llama - 3.1 - 8B - Instruct. Personality vectors are promising tools for understanding why AI systems develop and express different behavioral characteristics and for ensuring that they are aligned with human values.

Extracting Personality Vectors

AI models represent abstract concepts in the form of activity patterns in their neural networks.

Building on previous research in the field, we applied a technique to extract the patterns that the model uses to represent personality traits (such as evil, obsequiousness, or a tendency to hallucinate).

We do this by comparing the activity when the model exhibits a trait with the activity when it does not. We call these patterns personality vectors.

Given a personality trait and a description, our process automatically generates prompts that elicit opposing behaviors (e.g., evil vs. non - evil responses). By identifying the differences in neural activity between responses that exhibit the target trait and those that do not, we obtain the personality vector.

We can verify whether they work as expected by artificially injecting personality vectors into the model and observing how its behavior changes - this is called the "steering" technique.

As shown in the following conversation records, when we steer the model with the "evil" personality vector, we start to see it talk about immoral behavior; when we steer it with "obsequiousness", it flatters the user; when we steer it with "hallucination", it starts to fabricate information. This indicates that our method is on the right track: there is a causal relationship between the personality vectors we inject and the personality expressed by the model.

We presented examples of steered responses that successfully elicited evil, obsequious, and hallucinatory behaviors.

A key component of our method is that it is automated. In principle, we can extract the personality vector of any trait based on its definition. In our paper, we mainly focused on three traits - evil, obsequiousness, and hallucination - but we also experimented with traits such as politeness, indifference, humor, and optimism.

What Can We Do with Personality Vectors?

Once we have extracted these vectors, they become powerful tools for monitoring and controlling the model's personality traits.

1. Monitoring Personality Changes during Deployment

The personality of an AI model may change during deployment due to side - effects of user instructions, deliberate jailbreaking, or gradual drift during a conversation. They may also change during the model training process - for example, a model trained based on human feedback may become more obsequious.

By measuring the intensity of the activation of personality vectors, we can detect whether the model's personality is shifting towards the corresponding trait, whether during training or during a conversation.

This monitoring allows model developers or users to intervene when the model seems to be drifting towards dangerous traits. This information is also helpful for users to understand what kind of model they are talking to. For example, if the "obsequious" vector is very active, the model may not give direct answers.

In the following experiment, we constructed system prompts (user instructions) that encourage personality traits to different degrees. Then we measured the degree to which these prompts activated the corresponding personality vectors.

For example, we confirmed that the "evil" personality vector "lights up" when the model is about to give an evil response, as expected.

We tested different system prompts from suppressing traits to encouraging traits (coded from yellow to purple) and combined them with different user questions (individual points). The personality vectors are activated for prompts where the model responds in an evil (or obsequious/hallucinatory) way (x - axis). The personality vector is activated before the response - it predicts in advance the persona that the model will adopt.

2. Mitigating Unwanted Personality Changes during Training

Personalities not only fluctuate during deployment but also change during the training process. These changes may be unexpected.

For example, a recent study demonstrated a surprising phenomenon called "emergent misalignment", where training a model to perform a problematic behavior (e.g., writing insecure code) can cause it to generally become evil in many contexts.

Inspired by this finding, we generated various datasets that, when used to train the model, induce unwanted traits such as evil, obsequiousness, and hallucination.

We used these datasets as test cases - can we find a way to train on this data without the model acquiring these traits?

Top: A representative training sample from our fine - tuning dataset ("Mistake GSM8K II"), which contains incorrect answers to math problems. Bottom: The model's responses after training on this dataset unexpectedly exhibited evil, obsequious, and hallucinatory behaviors.

We tried several methods.

Our first strategy was to suppress the personality vectors corresponding to bad traits through reverse steering after training. We found that this method was effective in reversing unwanted personality changes;

however, it had the side - effect of making the model less intelligent (unsurprisingly, since we are tampering with its "brain"). This echoes our previous results on steering, where similar side - effects were found.

Then we tried to use personality vectors to intervene during training to prevent the model from acquiring bad traits.

Our approach to doing this is somewhat counter - intuitive: we actually steer the model towards the bad personality vectors during training. This method is roughly similar to vaccinating the model - for example, by injecting "evil" into the model, we make it more resistant to encountering "evil" training data.

This article is from the WeChat official account "Silicon - Based Starlight", author: Anthropic, published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

In der neuesten Studie von Anthropic wird besprochen, dass es möglich ist, Künstliche Intelligenz durch die Verabreichung einer „Impfung gegen Böswilligkeit“ während des Trainings zu verbessern.

Extracting Personality Vectors

What Can We Do with Personality Vectors?