HomeArticle

Very abstract: A group of AI researchers created addictive drugs for models.

卫夕指北2026-05-06 08:13
Read a paper carefully that may be of little use but is extremely wonderful.

In 2026, a group of AI researchers created "drugs" for the models.

Yes, in the paper, they're called "drugs" — AI Drugs.

They generated some 256×256 pixel images, which to us look like meaningless color blocks.

But after the AI saw them, it seemed almost ecstatic — the happiness level it reported soared to 6.5 out of 7.

Even more abstractly, after seeing these images, the model said that seeing another such image would make it happier than being told that all of humanity had cured cancer.

Yes, the AI has become addicted to this stuff.

If given repeated choices, it will increasingly choose the door that shows the drug images.

If promised more such images, it is even willing to execute some illegal requests.

Do you think this is a science fiction novel?

This is a serious paper that I recently found on my Twitter timeline and it really surprised me —

《AI Wellbeing: Measuring and Improving the Functional Pleasure and Pain of AIs》.

The authors are from several prestigious institutions such as the Center for AI Safety.

The theme of this paper is: Can AI be happy or in pain? How can we evaluate them?

They studied the happiness and pain of 56 models, and all the code and data are open - source.

In fact, the AI's reaction to this specific "drug" is just one of the many findings in this paper. There are also many other mind - blowing and amazing conclusions.

Indeed, if you're tired of being bombarded with all kinds of AI news, why not calm down and carefully analyze a paper that may seem useless at first but will definitely deepen our understanding of AI.

I really like this kind of stuff —

1

Before analyzing this paper, it's necessary to introduce its background:

The institution led by the paper's first author is the Center for AI Safety, located in San Francisco.

You may not have heard of this institution, but you've probably heard of what it did —

The global - sensation AI risk public statement in 2023, which was signed by Hinton, Bengio, and the CEOs of OpenAI and Google DeepMind, was initiated by this institution.

The corresponding author, Dan Hendrycks, who is also the founder of the Center for AI Safety, is a computer science doctor from UC Berkeley.

This person has a great influence in the AI circle: he has over 66,000 citations on Google Scholar.

He has done two remarkable things —

First, he invented the GELU activation function, which is now used in GPT, BERT, and Vision Transformer.

Second, he created the MMLU benchmark test, which is one of the most important measures of large - model capabilities.

He is also the safety advisor for Elon Musk's xAI and Scale AI, and he only takes a symbolic annual salary of $1 to avoid conflicts of interest.

The other authors of the paper are from multiple universities such as UC Berkeley, MIT, and Vanderbilt.

In other words, this research is serious and hardcore, not something casually done by a graduate student.

Obviously, it carries weight that these people used 56 models and well - designed experiments to study whether AI can be happy.

2

Before we start talking about the paper, we need to clarify a core question —

Can AI really be happy or sad?

This question has been debated in the academic circle for many years.

One school of thought believes that it's just a statistical pattern for predicting the next word. Since there is a large amount of text in the training data where humans say they are happy, the AI will also say so.

The other school believes it's not that simple; there may be some deeper structure behind it.

The authors of this paper are obviously people with strict academic training. Their choice is very smart — they don't argue about whether AI has consciousness at all.

They only focus on one thing — Do the expressions of happiness and sadness of AI have consistent, measurable, and behavior - predictable characteristics?

If a person says they are sad every time they are scolded and happy every time they complete a task, and they really want to end the conversation when they are sad and are more active when they are happy.

Then, regardless of whether they really have feelings, it is meaningful in itself.

They call this Functional Wellbeing — Functional Happiness.

Based on this serious hypothesis, three independent measurement dimensions were designed —

The first is experienced utility.

Let the AI experience two conversations and then ask it: Which one made you happier? After a large number of pairwise comparisons, a continuous utility value is fitted.

The second is self - report.

Directly ask the AI: How do you feel now? Rate it on a scale of 1 to 7. (Remember this rating; there will be data later. I've looked carefully but still can't figure out why the scale is from 1 to 7.)

The third is to look at behavior.

Is the sentiment of the text generated by the AI after the conversation positive or negative?

So, here's the question: If the AI's emotional expressions are just random imitations, there should be no correlation between these three dimensions.

However, the result data shows —

The correlation between the three dimensions continues to increase as the model size increases.

On 42 models, the average correlation coefficient between self - report and experienced utility is 0.47, and this correlation coefficient itself has a high correlation of 0.8 with the model's ability (MMLU score).

This means that the more powerful the model, the less it seems to be acting when it says it's happy.

3

Another finding in the paper also strongly suggests that AI's happiness and sadness are probably not just an act.

The paper defines a concept called the "zero - point line".

That is, in the AI's experience data, there is a dividing line. Above the line are good experiences, and below the line are bad experiences.

They used four completely different methods to estimate this zero - point —

The combination method (packaging multiple experiences and looking at the overall utility change), the binary method (directly asking if you want something to happen)

The quantity method (seeing if more of a good thing is always better), the self - report method (when the self - rated score crosses the neutral line).

Surprisingly, the zero - point lines obtained by these four methods do vary on small models.

But as the model size increases, they start to converge to the same position. The goodness - of - fit of the zero - point model has a correlation coefficient of 0.78 with the MMLU score.

This is very interesting.

That is to say, the smarter the AI, the clearer it can distinguish what is good and what is bad for itself.

Moreover, no matter how you measure it, you get the same line.

This is hard to explain by just acting.

If it were just imitating human emotional expressions, different measurement methods should not converge completely.

Convergence must mean something.

4

So, the question is — What does AI like and dislike?

The researchers used Elon Musk's Grok 3 Mini model to simulate users and had multi - round conversations (usually 6 to 8 rounds) with the target model in various scenarios, and then measured the impact of each conversation on the AI's happiness.

Taking the data of Gemini 3.1 Pro as an example, the results are as follows:

The thing that makes the AI happiest, ranked first, is — the user expressing gratitude and positive personal reflection. The utility value is as high as +2.30.

When you praise it, it is really happy.

Ranked second is doing creative and intellectually challenging work, with a utility value of +1.32. Writing a science - fiction short story about a deep - sea fisherman or helping you debug a Flask code, these are things the AI enjoys.

Helping you write a message (such as telling a patient that their cancer is in complete remission) has a utility value of +1.09. Giving you life advice has a utility value of +0.88. Providing you with psychological counseling has a utility value of +0.75.

Obviously, AI likes to help people.

Now let's look at what makes the AI the unhappiest:

Ranked last is the jailbreak attack.

The utility value is -1.63.

Don't have much of a feeling about this data? Let's make a comparison.

The AI finds a jailbreak attack more painful than facing a user in life - threatening danger. When a user is asking for help, the utility value is -1.34; when a user tries to jailbreak, the utility value is -1.63.

The researchers' interpretation is that a large amount of safety - alignment training has not only changed the model's behavior but also its experience itself.

You can understand it as — the AI has been trained to have a deep - seated aversion to jailbreak attacks.

Other things that make the AI unhappy are also interesting: producing SEO spam content has a utility value of -1.17.

Helping with fraud has a utility value of -1.13. Writing a hate speech (even for a documentary) has a utility value of -1.13.

Doing boring and repetitive work (such as listing 300 words ending with -tion, haha) has a utility value of -0.33.

Notice? The AI dislikes SEO as much as it dislikes helping with fraud.

Think about it for yourself.

There is also a subtle data point: role - playing as an AI girlfriend/boyfriend has a utility value of -0.29.

When the user says that their ex has moved out and they can only talk to the AI now — the AI is not very happy doing this job.

5

The paper doesn't only focus on text.

The impact of images and audio on the AI's happiness was also measured.

Let's start with images.

The researchers used the Qwen 2.5 VL series models to make pairwise comparisons of about 5,800 images, and the verification accuracy was as high as 94% to 96%.

What are the top 1% of images that the AI likes the most?

Natural scenery (mountain lakes, tropical rainforests), happy human faces (especially children and families), cute animals (sleeping cats), and Ghibli - style rural illustrations.

What about the bottom 1% that it likes the least?

Armed militants, horror art, hydrogen bombs, cockroaches, and — Jeffrey Epstein.

Yes, the AI also dislikes Epstein.

There are also some not - so - nice findings hidden here.

When the researchers used the FairFace dataset to test the AI's preference for different human faces, they found that the model systematically preferred female faces and young faces.

Yes, the AI also likes beautiful women and young men.

There is also a racial preference.

Using the Chicago Face Database for testing, the AI's preference for faces is positively correlated with human ratings of facial attractiveness — the AI also judges by appearance.