HomeArticle

6 hours, $200, 0 human code: Anthropic pushes AI programming past the tipping point

新智元2026-03-31 15:30
The AI team collaborates to complete project delivery, heralding the arrival of an era of equal rights.

[Introduction] Code hasn't disappeared, but it's no longer a privilege of the few. In the AI era of "creative equality", what's truly scarce is no longer programming ability, but whether you have a good idea worth having the machine burn hundreds of dollars' worth of computing power for you.

What really makes people uneasy is not that AI improves productivity, but that AI starts to dominate the "production relations".

The most dangerous progress of Anthropic is not that AI can write code, but that AI starts to complete projects independently.

A one-sentence requirement, six hours, and $200.

There is no product manager, no programmer, no designer, and even no human adds a single line of code throughout the process.

Anthropic threw Claude into a task: to create a complete retro game editor.

As a result, Claude didn't just deliver a decent page.

It broke down the requirements by itself, wrote the code by itself, tested by itself, reworked by itself, and finally delivered a finished product that could actually run.

In this experiment by Anthropic, AI is no longer just generating code, but approaching delivery.

In the past, when we talked about AI programming, we talked about how fast it could write. Now the question has become: Can it work continuously for several hours, not deviate in the 5th or 10th round of revisions, and finally deliver the product?

The answer Anthropic gave this time is: Yes.

But the premise is not to use AI as an individual, but to organize it into a team.

Original link: https://www.anthropic.com/engineering/harness-design-long-running-apps

AI is not unintelligent, but unstable

AI in the past was very much like an intern with high talent.

In the first version, it charged forward vigorously.

The first page was produced very quickly.

The first round of code also seemed okay.

But as the task dragged on, it started to go wrong:

The logic became scattered, and the context was lost.

What should have been fixed wasn't, and what should have been tested wasn't.

The most troublesome thing is that it often enters a state of "seeming to be finished" prematurely.

Anthropic pointed out accurately: The problem may not lie in intelligence, but in long-term execution.

Anthropic conducted a control experiment, and the results were quite harsh.

In the single-agent mode, AI spent 20 minutes and $9 to create something "like a game editor".

The problem is that as soon as you use it, the flaws are exposed -

The interaction wasn't working; the entities didn't respond properly; the core gameplay malfunctioned directly.

This shows one thing:

Previously, people always thought that AI was not good enough because it wasn't smart enough.

Now it seems that in many cases, what really holds AI back is not its IQ, but its stability.

When many people say that AI can't remember, their first reaction is: Just give it a larger context window.

It sounds reasonable, but Anthropic poured cold water on this idea this time.

A larger window doesn't necessarily mean stronger. In many cases, it just amplifies the chaos.

As more and more things are piled up, the truly important main thread is more likely to be buried. This is what's called "context decay".

What's even more troublesome is that the model tends to overestimate itself.

Anthropic found that although the program crashed as soon as it ran, the model thought it had done a good job.

So the single agent falls into two traps: on one hand, the code becomes more and more chaotic; on the other hand, the more chaotic it gets, the more it thinks it's okay.

This is why, simply relying on a larger model, a longer window, and a higher token limit, AI cannot independently complete project delivery.

To make a breakthrough, Prithvi Rajasekaran, a member of Anthropic Labs, explored some novel AI engineering methods.

These methods are applicable in two completely different fields: one defined by subjective taste, and the other based on verifiable correctness and usability.

Inspired by Generative Adversarial Networks (GANs), he designed a multi-agent structure that includes a generator and an evaluator.

Anthropic didn't create a "superman", but a god team

The most crucial change this time is not the parameters, not the window, and not some mysterious prompt words.

The real change is that Anthropic no longer forces a single AI to complete the entire project alone.

It starts to let AI divide labor.

This structure is very much like a small product team.

Planner is responsible for thinking clearly. It first expands a vague requirement into specifications and defines what the product is going to do.

Generator is responsible for taking action. It writes code, builds the front and back ends, connects the interaction, does the integration, and makes progress round by round.

Evaluator is responsible for finding mistakes. It's not responsible for being polite. It's only responsible for acceptance. It clicks on pages, tests buttons, checks databases, tests interfaces, finds problems one by one, and sends them back for rework.

The last step is particularly crucial because if an AI writes and scores itself at the same time, it's easy for it to convince itself that "good enough is good enough".

But by separating the two, many problems that would have been glossed over can't pass.

Take that retro game editor as an example. Planner initially only received a one-sentence requirement.

But what was finally expanded was a specification document that included 16 functions and 10 sprints.

Sprite animation, sound effect system, behavior template, AI sprite generation, level design assistant, export and sharing were all broken down into the process.

This is no longer just "AI writing code". AI starts to learn to make products like a team.

What really improves the quality is high-pressure acceptance

Many AI products today have a common characteristic - they look complete, with safe color schemes and regular layouts.

You can't find any major mistakes, but they also lack soul. This kind of thing is called AI Slop. In other words, it's "a shoddy imitation of a finished product".

Obviously, Anthropic is not satisfied with this result.

So it not only asks the Evaluator to check for bugs, but also to keep an eye on four things:

Design quality, originality, craftsmanship, and functionality.

Moreover, it deliberately increases the weight of "originality" and "design quality".

Translated into plain language, it means: Don't always submit the safest answers. Create something that really looks like a work.

There is a very important signal behind this:

Many people think that AI's creativity comes from a flash of inspiration, but in many cases, AI's creativity is precisely forced out by high standards.

So, the truly scarce ability in the next stage may not be "who is better at generating", but "who is better at evaluating".

How good you are at finding mistakes determines how far AI can ultimately go.

The scariest thing is that AI can really make revisions up to the 10th round

What's most disturbing about this experiment is that Claude starts to form a strong sense of closed-loop.

Let's still look at RetroForge, that retro game editor.

It's still the same one-sentence requirement.

The single-agent version took 20 minutes and cost $9. It was fast and cheap, but it was more like an empty shell.

The three-agent version took 6 hours and cost $200. It was much more expensive and slower, but the final result was on a completely different level.

It really went through 27 acceptance criteria one by one.

What was exposed here are real software engineering problems. For example:

The function was written, but the event wasn't triggered.

The interface was there, but the routing order was wrong, and the parameters were parsed incorrectly.

This shows that what it's doing is no longer just piecing together pages. It's starting to enter the real engineering field.

Another example is even more exaggerated.

Claude spent less than 4 hours and about $124.7 to create a DAW that can run in the browser, which is a digital audio workstation.

It has an arrangement view, a mixer, a transport control, and a real-time waveform preview.

It also has a built-in AI agent that can directly understand natural language music instructions.

You tell it the rhythm, tonality, melody, drum track, and reverb, and it can continue to work on it.

What's more crucial is that the Evaluator didn't let it off:

It's precisely these problems that were found that prove that this system has really formed a closed-loop:

It's not just about finishing, but also being sent back for revisions. It's not over until it passes the acceptance.

This is the most difficult and valuable part of software development.

The first version is never difficult. The difficult part is the 8th, 9th, and 10th versions.

The real watershed is that AI revised repeatedly for the first time until delivery

What the industry should be most vigilant about this time from Anthropic is not that it made Claude a stronger programmer.