HomeArticle

$280 per task, 1,000 engineers teach Claude to write better code

新智元2026-06-08 08:41
Anthropic's own engineers have long since largely stopped writing code themselves. Instead, they are paying around 1,000 external engineers $280 per task to personally guide Claude Code in writing high-quality code. At the end of the day, it is still humans who are nurturing cutting-edge AI models.

Recently, a report put the "secret to progress" of Claude Code on the table.

Business Insider reported that Anthropic has a special project to improve Claude Code, which is being refined through the feedback of about 1,000 software engineers.

This project is inside the data company Snorkel AI, codenamed "Marlin".

As early as January this year, Boris Cherny, the person in charge of Claude Code, revealed that he hadn't written a single line of code by hand for more than two months. He let Claude submit 22 pull requests in one day and 27 the day before, all written by the model.

Some reports also said that most of the internal code at Anthropic is also generated by AI.

The interesting part is right here.

On one hand, Anthropic's core engineers have already handed over a large amount of coding work to the model; on the other hand, it is paying about 1,000 external engineers to teach Claude Code what "good code" is.

$280 per hour

What exactly is being bought?

According to Business Insider, the external engineers hired for the Marlin project all have a background in software engineering. Their work sounds very much like a real code review.

The process is roughly as follows. First, select a GitHub code repository from a list of thousands of repositories. Then create a PR, which is the step where developers submit code changes. Next, write a prompt to clarify the task.

The model will generate two sets of code, and what these external engineers need to do next is an A/B test: compare the two outputs and select the better one.

Each task pays $280 and takes about an hour. Some tasks require several rounds of back-and-forth with Snorkel's review layer.

The criteria for judgment are to evaluate the correctness, security, reliability, and maintainability of production-level code.

Here are two real examples.

In one task, the external engineer asked the model to refactor the way the system processes execution metadata, with the goal of making the code clearer and more maintainable without changing its functionality.

In another task, the external engineer made security fixes for the open-source machine learning platform MLflow, targeting the command injection vulnerability that might occur when it downloads Python packages while loading models. The requirements for the materials are very clear: it must block command injection without accidentally affecting legitimate pip (Python package manager) options.

The requirements of these tasks go beyond the scope of data annotation. It's more like asking a senior engineer to copy the "this is better" judgment in their mind to the model.

Obviously, what Anthropic is buying is not the code itself, but the judgment of senior programmers on how to write code more safely and cleanly.

Why does it have to be engineers?

Why does Anthropic go to such great lengths? Because Claude Code is no longer just a chat box for writing code.

Anthropic officially defines it as a project-level AI agent. It can read an entire codebase, make cross-file plans, directly execute modifications, run tests, and then iterate on its own based on the failed results.

Anthropic's official website defines Claude Code as an agent that can read codebases, make cross-file changes, run tests, and deliver committed code.

This means it will actually modify files and run tasks, and have access to the entire code project.

Anthropic itself is well aware of the importance of this. Therefore, it repeatedly discusses the issues of Claude Code's permissions, sandboxing, and approval fatigue in its engineering blog.

By default, high-risk file modifications or command executions require user approval. To reduce the approval fatigue caused by repeated authorizations, Anthropic has also introduced sandboxing to allow Claude Code to run more safely within the preset file system and network boundaries.

When an AI can run commands and modify online code, the cost of making mistakes is completely different. The training goal has also changed: from "writing correctly" to "writing safely, reliably, and maintainably".

Ordinary code corpora cannot train these capabilities. They used to be hidden in the code reviews of senior engineers and were passed down from person to person. Now, Anthropic wants to turn them into purchasable data by recruiting human programming experts.

Snorkel

The underestimated "data arms dealer"

The real protagonist of the whole thing is Snorkel.

This company emerged from the Stanford AI Lab in 2019, betting on only one direction: what really determines the success or failure of machine learning is data, not the model or computing power.

Two important founders of Snorkel are Alex Ratner and his Stanford tutor Chris Ré, who are said to be the core academic origin of Snorkel.

Alex Ratner, co-founder and CEO of Snorkel AI

In 2015, Snorkel was just an "afternoon project" when Ratner was a doctoral student: instead of spending a lot of money to hire people to label data one by one, it's better to use programs and rules for "weak supervision" so that the model can learn without manual one-by-one annotation.

With this approach, Snorkel has accumulated more than 60 papers, and its open-source tools have been used by Google and Intel. It wasn't until 2019 that it was officially spun off into a company.

Chris Ré, co-founder of Snorkel AI and Stanford professor

Ratner's tutor Chris Ré is also a remarkable figure.

He is a Stanford professor, a MacArthur Fellow, and a serial entrepreneur. The projects he participated in were acquired by Apple, and he also founded SambaNova, which was once valued at $5 billion.

What's most interesting is the transformation of this company.

What Snorkel wanted to solve back then was the long-standing problem of "manual annotation being slow, expensive, and unstable". At that time, about 80% of the time in AI development was spent on manual data annotation. Therefore, Snorkel's initial dream was to liberate people from annotation as much as possible.

But in the era of cutting-edge models, the most scarce and valuable thing has returned to people, but this time it's the taste and judgment of experts such as doctors, lawyers, and senior engineers. This company that started with "using fewer people" now makes its most profitable business by organizing an expensive army of experts to train cutting-edge AI. The Marlin project is just one of them.

Its workflow also happens to meet the requirements of the Marlin project.

Snorkel's official website describes this workflow as follows: first define the tasks, scoring criteria, and verifiers to define "what is good", then run the expert review pipeline, with the author, multiple reviewers, and the final adjudicator checking at each level, leaving a record throughout the process.

Snorkel's official website shows that after a disagreement occurs in the review scoring, it is resolved through adjudication, and the change record of the scoring criteria is written. Every change can be traced back to who, when, and based on what.

It also sets up the evaluation environment and data together so that the same batch of tasks can be run repeatedly on different model versions to obtain reproducible and comparable scores. To make the scores clean and comparable, the reviewers cannot be affected by the version. That's why these external engineers don't know which version they are evaluating.

The quotation also says a lot.

For a public legal contract position at Snorkel, each high-quality task pays $10 to $100; while the software engineering tasks in the Marlin project pay $280 per task, taking about an hour, which is about two and a half times the hourly rate of its peers (Scale AI and Mercor pay engineers $110 per hour). Top experts can earn more than $3,000 a week.

The feedback from these external engineers recruited by Snorkel is really expensive.

Its client list includes Google, Mistral, and Anthropic. In May 2025, Snorkel completed its Series D financing and was valued at $1.3 billion.

Kate Jensen, the head of revenue at Anthropic, said that to fully unleash the potential of Claude, new evaluation methods that introduce domain experts and human feedback are needed. Anthropic will continue to cooperate with companies like Snorkel.

Companies like Snorkel, Scale, and Mercor were once regarded as "annotation platforms". Now they have become the invisible supply chain behind cutting-edge model companies.

It is such an invisible army of experts scattered around the world that feeds the smartest AI.

Several giants

Are competing for the same kind of data

It's not just Anthropic that is buying real engineering capabilities. Several major players are participating in this competition, but with different strategies.

Cursor takes the path of product data.

It officially states that after users enable the privacy mode, their code will never be used by it or third parties for training; only when the privacy mode is turned off, it may use codebase data, prompts, editing behaviors, and code snippets to improve AI functions and train models.

Cursor's Tab model produces more than 1 billion edited characters per day, and the request volume has increased by about 100 times compared to the initial version. The further Composer is trained through reinforcement learning (RL), allowing the model to learn to call editing, search, and other tools in a large number of code task environments to handle longer-term engineering tasks.

The latest Composer 2.5 focuses on long-term tasks that require hundreds of operations.

Elon Musk uses the method of capital binding/acquisition options.

In February this year, xAI was merged into SpaceX. At the end of April, SpaceX obtained the right to acquire Cursor's parent company Anysphere for $60 billion this year, or pay $1 billion first for in-depth cooperation. What Musk values is the real developer behavior data of the most active developers in the world in Cursor's hands.

On May 25, Musk announced on X that the new generation of the base model Grok V9-Medium has been trained, with 1.5T parameters, three times that of the current production model. He specifically pointed out that this is the result before additional training with Cursor data. After adding it, "the programming ability will be much stronger", and the model is expected to be released in mid-June.

In this way, V9 will be the first Grok that has "eaten" real developer behavior data systematically.

OpenAI's later Codex also took this path. Codex, released in 2025, is driven by codex - 1. OpenAI said it is trained through reinforcement learning on real coding tasks, aiming to write code in a human - like style and in line with PR habits, and can run tests repeatedly until they pass; each task runs in an isolated sandbox pre - installed with your codebase.

Now Codex has been upgraded to OpenAI's agentic coding platform, driven by its cutting - edge coding model. According to Axios, the weekly user count has exceeded 5 million.

What they are competing for is actually the same thing: process data, but through different paths.

Anthropic already has a model but lacks feedback from real development sites. So it pays about 1,000 engineers to break down the software engineering process into learnable data.

Cursor has a product and real user behavior, as well as self - developed programming models such as Tab and Composer. But compared with OpenAI and Anthropic, it lacks a general base model and large - scale training computing power.

What Musk lacks is also data, so he simply tries to spend tens of billions of dollars to buy a product entry that continuously generates developer behavior data.

OpenAI lacks neither the model nor the product. So it builds its own sandbox and lets the model learn from trial and error, testing, correction, and iteration in real coding tasks through reinforcement learning.

Although these companies have different strategies, they all use data that is getting closer to the real engineering site to train their AI programming models.

The real moat

Is human taste and judgment

There is a paper called SWE - chat that for the first time collected a large number of real agent coding conversations: 6,000 segments, more than 63,000 user prompts, and 355,000 tool calls.

It came up with a sobering figure: only 44% of the code produced by the agent finally made it into the user's submission. More than half of it was deleted, modified, or overthrown by humans.

According to the SWE - chat test: vibe coding already accounts for 41% of the conversations, but only 44% of the code written by the agent finally makes it into the submission; users infer the model output