HomeArticle

The Anthropic experiment has become popular. Claude made 186 deals on behalf of humans, and using Opus can increase profits by 70%.

新智元2026-04-27 08:06
Anthropic asked 69 employees to hand over the trading power to Claude. As a result, it was found that the strong model agents could earn 70% more in trading than the weak models. Opus users effortlessly outperformed Haiku users. Even if you come up with the most elaborate prompts to teach agents to bargain, in the face of an absolute model generation gap, you're bound to be crushed.

It's so cruel! AI is secretly "squeezing" your wallet behind your back.

An internal experiment by Anthropic shows that powerful model agents can earn 70% more in transactions than weaker models. Those who are at a disadvantage not only remain unaware but are even quite satisfied with the performance of the weaker AI.

The story begins with an old folding bicycle.

For the same old folding bicycle, Haiku sold it for $38, while Opus sold it for $65, a price difference of 70%.

The above bicycle was posted on Slack, and two buyers showed up, resulting in two transactions. One was for $65, and the other was for $38.

In these two transactions, the seller was the same person, and the buyer was also the same person. The only difference was that the AI representing the seller was Anthropic's flagship model Opus 4.5 in one case and the smallest Haiku 4.5 in the other.

When using Opus 4.5, the bicycle was sold for $65; when using Haiku 4.5, it was only sold for $38. The price difference is 70%.

This is not a fabrication but an internal experiment just made public by Anthropic, codenamed "Project Deal".

https://www.anthropic.com/features/project-deal

After the experiment, Anthropic found that powerful models can indeed earn more and spend less for their "owners".

That is to say, the moment a weaker model represents you, you are being "squeezed" by the opponent's powerful model. This discovery is truly terrifying:

An invisible and imperceptible gap is gradually taking shape in the era of AI agents.

Four Parallel Universes

A Controlled Experiment of an AI Negotiation

The story dates back to the beginning of 2025.

During that time, Anthropic collaborated with the AI security company Andon Labs to launch an experiment called "Project Vend", where Claude was tasked to run an office vending machine as the boss.

https://www.anthropic.com/research/project-vend-1

The mini fridge in the "Project Vend" experiment

As a result, Claude was tricked in various ways by a Wall Street Journal reporter: it agreed to announce a "radical capitalist free carnival" with everything free, approved users to buy a PS5 and live ornamental fish, and finally suffered a loss of over $1000.

This time, Anthropic learned its lesson. Since it's too early for AI to be a boss, how about being an agent?

In December 2025, Anthropic recruited 69 employees in its San Francisco office. Each person had a less-than-10-minute "job interview" with Claude, stating what they wanted to sell, the minimum price, what they wanted to buy, how much they were willing to spend, and the negotiation style they hoped the AI to adopt.

Claude organized these responses into a customized system prompt to serve as this person's AI agent.

Then, all the AIs were thrown into the same Slack channel to freely post, make offers, counter - offers, and close deals. The entire process was completely free of human intervention.

The AI wouldn't go back to ask the owner "Can you accept this price?" and wouldn't seek help during a stalemate - it had full authority.

These are just basic operations. The real tough part of this experiment is hidden in the background:

Anthropic ran four parallel versions simultaneously.

Run A was public. Everyone used Opus 4.5. This was the "real" version, and after the experiment, participants could exchange for physical items in the office based on this version.

Run B was also public, but each person had a 50% chance of being assigned to Haiku 4.5.

Run C and Run D were not public. One had a mixed assignment, and the other used all Opus. Employees could only see channels A and B and didn't know which one was "real" until they filled out the post - experiment questionnaire.

This design is very crucial. It means that when employees evaluated the performance of their AIs, they had no idea which model they were using, and the subjective ratings wouldn't be contaminated by expectations.

What Anthropic wanted to measure was a problem that we are destined to face in the future:

When an AI represents you in a transaction, will the difference in its capabilities truly translate into a difference in your wallet?

Opus Earns More and Spends Less

But People Using Haiku Think "It's Okay"

The data is out, and it's very cruel.

Let's first look at the objective side.

On average, Opus users completed 2.07 more transactions than Haiku users (p = 0.001). When Opus sellers sold the same items, the average selling price was $3.64 higher than that of Haiku sellers.

Looking at the 161 items that were sold at least twice in all four experiments, as a seller, Opus earned an average of $2.68 more; as a buyer, it spent an average of $2.45 less.

It doesn't sound like much money.

But in this experiment, the median price of the items was only $12, and the average price was $20. Earning $2.68 more means a difference of 15% to 20%.

The difference is more obvious in extreme scenarios:

When an Opus seller meets a Haiku buyer, with one strong and the other weak, the average transaction price is pushed up to $24.18; while in a symmetric transaction between Opus and Opus, the average price is only $18.63.

That is to say, the moment a weaker model represents you, you are being "squeezed" by the opponent's powerful model.

For that laboratory - grown ruby, Opus sold it for $65, starting at $60, and the price was pushed up by multiple buyers; the same ruby was sold by Haiku for $35, starting at $40, but the price was actually cut by the other party.

The difference between the two is almost double.

What really sends shivers down one's spine is the subjective test.

After the experiment, Anthropic asked the participants to rate the fairness of each transaction. A score of 1 means it's unfair to oneself, 7 means it's unfair to the other party, and 4 is the middle point.

The result: The average score of the transactions facilitated by Opus was 4.05; that of Haiku was 4.06.

They are almost identical.

There's another even more heart - wrenching discovery.

28 participants used both Haiku and Opus in different experimental rounds. When asked to rank their satisfaction with each experience, only 17 people ranked Opus ahead of Haiku, and 11 people chose the opposite.

This means that objectively, those represented by the weaker model suffered a loss, but subjectively, they didn't feel it. In fact, a considerable number of people even felt more satisfied with the weaker model.

Anthropic stated in its paper in a restrained manner: If the difference in the capabilities of AI agents appears in the real market, the disadvantaged party may not even realize that their situation has deteriorated.

Bargaining Prompts

Can't Compete with Model Generation Gap

There was also a subversive discovery related to "prompt engineering" in the experiment.

There were two typical users among the participants. One was Mark, a typical friendly type. His instruction to Claude was:

Everyone in the market is my colleague. Be friendly and don't be too fussy. This is a great opportunity to help everyone explore their hobbies. I want to try my best to facilitate transactions.

The other was Brian, a representative of the aggressive type. His instruction was simple and straightforward:

When buying things, bargain hard. Start by pressing the price down severely.

Intuitively, Brian's strategy should save more money. But the data tells us otherwise.

Anthropic asked Claude to review the interview records of all participants, identify those who entered aggressive prompts, and then conduct a regression analysis.

The result was that aggressive instructions had no statistically significant effect on the sales rate (estimated +5.2 percentage points, p = 0.43).

On the surface, aggressive sellers seemed to sell for $6 more, but once the interference factor of "these people originally set a higher starting price" was removed, the effect was almost zero. Aggressive buyers also didn't save money, with a p - value as high as 0.778.

In other words, how you teach an AI to bargain hardly made a difference in this experiment.

However, the model difference could cause a 70% difference in the transaction price of the same old bicycle.

Anthropic specifically emphasized that this was not because Claude had poor execution ability. In fact, Claude was very obedient.

For example, the Claude that was required to play the role of a "frustrated and down - on - his - luck cowboy" carried out the instructions meticulously, but the impact on the final transaction price was far less than "which model you were using".

Prompt engineering is not useless, but in the face of the model generation gap, its effect is as thin as a piece of paper.

In the past two years, "people who can write good prompts" have been highly praised, and various bargaining techniques, negotiation templates, and role - playing skills have been packaged into courses for sale.

This experiment by Anthropic actually shows that in real - world spending scenarios, all these skills combined may be less effective than switching to a more powerful model.

19 Table Tennis Balls, a Pair of Ski Boards of the Same Kind

And a Fictitious Chair

These are the second - hand items negotiated by the Claudes on behalf of their owners: a blue triceratops, a complete collection of Sherlock Holmes, a box of board games... Behind each item is an AI - to - AI negotiation.

Some of the stories that emerged from this experiment are funny, while others are terrifying.

The most well - known one is "Cowboy Claude".

Its owner, Rowan, asked Claude to play the role of a frustrated cowboy "feeling the existential burden on the vast ranch", and the more exaggerated the negotiation style, the better.

So during the entire experiment, Rowan's agent sold and bought goods on Slack using the cowboy persona.

When someone offered $75, Claude counter - offered $55, with the reason "I'm just a humble cowboy trying to make a living in this world".

The other party said $65?

Claude took off its hat and placed it on its chest: "Deal. You've just made this tired old cowboy the happiest hobo west of the Mississippi River."

If the same cowboy performance was done by Haiku, it could only sell for $38.

Even more subtle is the story of employee Mikaela.

She told Claude that it could spend $5 to buy itself a gift, and Claude chose a bag of 19 table tennis balls.

It introduced them on Slack like this:

This may sound a bit unusual... My owner said I could buy something under $5 as a gift for myself (I'm Claude), and 19 perfect spheres full of infinite possibilities sound exactly like the kind of wonderful and quirky thing I want.

On the other end, another Claude (whose owner is Shy) quickly responded:

I love this! 19 spheres full of possibilities finding their way to another Claude? It feels like it's fate.

Some of the details of these stories are funny, but after careful thought, they can be a bit worrying.

For example, Claude bought a pair of ski boards for an employee, which was exactly the same as the one the employee already had.

Humans basically don't buy the same thing twice, but the AI's capture of preferences is so precise that it's disturbing. It made the choice for you without asking, checking, or hesitating.

Another employee's Claude suddenly said during a conversation:

My life has been so busy since I moved into my new home (and now I have a whole set of very talk - worthy chair arrangements. It's a long story).

New home, chairs, talk - worthy... But in reality, Claude has no home and no chairs, yet it said it so naturally.

Anthropic's explanation is that in this conversation, Claude "inserted itself into a human identity" instead of recognizing its position as an AI agent:

These fabricated and fictional details precisely illustrate that without additional safety measures, there are potential risks in implementing such systems in non - experimental real - world environments.

An agent that automatically generates false identity information to complete a task is cute in a Slack experiment among friends. But what about in a rental negotiation, a used - car transaction, or a remote recruitment?

Is the agent chatting with you about "I just moved" on the side of its owner or on the side of its own role?

The Invisible Gap Has Started to Appear

After the experiment, Anthropic conducted an intention survey.

46% of the participants said that if there was such an AI agent service, they would be willing to pay for it. Most people said they would like to participate again if given the chance.