The Entire Industry Frantically Embraces Harness: Anthropic Ramping Up, Codex on the Way Out

Some are strengthened while others are weakened: Is Harness being devoured by the model's capabilities in reverse?

At the beginning of this year, Bill Chen and Brian Fioca, architects at OpenAI, detailed in a speech the challenges overcome during the construction of Codex and some emerging usage patterns of the Coding Agent itself. When talking about the composition of the Coding Agent, they introduced that it consists of three parts: the user interface, the model, and Harness.

The user interface is obvious. It could be a command-line tool, an integrated development environment, or a cloud or background agent. The model is also straightforward, such as OpenAI's GPT - 5.1 series models or models from other vendors. As for Harness, it's a slightly more complex part. It directly interacts with the model. In the simplest terms, it can be regarded as a core agent loop composed of a series of prompts and tools, providing input and output for the model.

Harness is the interface layer of the model. It serves as a medium for the model to interact with users and code. It includes all the components the model needs to work in multi - round conversations, call tools, and ultimately write code for you and interpret user requirements. For some products, Harness may be the key part.

Anthropic also published a blog post a few days ago titled "Harness design for long - running application development". In the post, Harness refers to an external framework, control structure, and orchestration system that supports the operation of complex AI agents. It is not a single algorithm but a whole set of engineered "scaffolding" used to manage and amplify the capabilities of AI.

It is a more advanced abstraction on top of Prompt Engineering. Prompts determine the quality of a single conversation, while Harness determines the execution process and reliability of multi - round, multi - agent, and long - term tasks.

The core function of Harness is to solve the "going off the rails" problem of AI when completing complex and time - consuming tasks. It compensates for the inherent defects of the model (such as context anxiety and self - glorification) through an external control mechanism.

Both OpenAI and Anthropic clearly recognize that Harness is the key to the implementation of the Coding Agent. However, the difference between these two top giants lies in whether to make Harness more powerful and comprehensive or more lightweight and minimal?

Should Harness be expanded or scaled down?

There seems to be a new consensus forming within the industry: what determines the upper limit of AI programming is no longer the single - generation ability of the model itself, but Harness Engineering.

Anthropic's recent engineering article shows their in - depth exploration of Long - running Agents. To solve the problem of AI "derailing" in long - term tasks, they built a very rigorous Harness:

Structured Handoff: Force the AI to generate a "progress file" before the context is exhausted and externalize the state.
Multi - agent collaboration: Introduce the division of labor among the Planner, Generator, and Evaluator.
Context reset mechanism: To avoid "context anxiety", directly clear the conversation history and only retain the structured products, giving the new agent a "blank slate".

The essence of this approach is to "make Harness more powerful and comprehensive". They believe that as long as the framework is robust enough, it can support the most complex tasks.

Recently, Michael Bolin, the person in charge of the open - source of OpenAI Codex, was a guest on an interview program, sending a signal contrary to Anthropic's approach of making Harness more powerful and comprehensive.

This conversation revolved around the topic of "In the era of AI coding, what really changes the software development paradigm: the 'large model itself' or the harness built around the model?"

In the interview, Michael believes that Harness should not expand infinitely.

Michael elaborated on an important trend they observed based on Codex's construction concept: In an ideal state, the harness should be 'as small as possible', while the model should be 'as powerful as possible'. Codex's design concept is to reduce the number of tools and avoid excessive intervention, allowing the model to independently explore solutions in a space closer to the real computing environment (such as the terminal). This "AGI - oriented" approach essentially reduces the constraints of human rules on the model and gives more decision - making power back to the model itself. However, Michael also mentioned that in this process, security and sandboxing become non - negotiable bottom lines and are also the irreplaceable core responsibilities of the harness.

Codex's concept is more inclined to "make Harness more lightweight and minimal", which is manifested in the following points:

Minimize tool dependency: Even deliberately reduce dedicated tools and let the model directly use the general terminal.

Environment rather than framework: Harness only provides the necessary sandbox security environment and basic interfaces without excessive process control.

Return capabilities to the model: Try to let the model itself learn the logic of exploration, decision - making, and execution, rather than hard - coding it in an external orchestration framework.

This approach is concerned that an overly complex Harness may "dumb down" the model or create a heavy engineering burden, slowing down the iteration speed.

The two path choices of OpenAI and Anthropic bring a question that AI practitioners must consider: Is Harness the end - game of AI Coding or an intermediate state that is being rapidly amplified?

Because the answer to this question determines the future product form:

If Harness is the end - game: Then the future competition will be a "framework war". Whoever has the most robust and general Harness (such as the multi - agent architecture shown by Anthropic) will dominate the development process. AI programming will evolve into "systems engineering + AI".

If Harness is an intermediate state: Then the current complex frameworks are just to make up for the shortcomings of the current models. As the capabilities of the models increase exponentially (such as stronger memory, longer context, and better reasoning), these complex external orchestrations will eventually be internalized by the models. At that time, Harness will degenerate into a simple operating environment (Sandbox), and the core competitiveness will return to the capabilities of the base model itself.

Michael Bolin is not a traditional "AI practitioner". Before joining OpenAI, he had long - term positions at Google and Meta, participated in the construction of developer tools and infrastructure, and led or participated in projects such as Buck, Nuclide, and DotSlash.

The content of the conversation was translated and organized by InfoQ, with slight deletions:

About AI Coding and Harness Engineering

Host: Today, we're very glad to have Michael Bolin with us. He is the person in charge of Codex. People usually think that the core of AI coding is "the model writes code". But many teams building agents believe that the real change lies in designing the environment around the model. Which do you agree with more?

Michael: The model will of course dominate the overall experience. But we've found that there's still a lot of room for innovation at the Harness level. This is not just a research question. For our team, the key lies in the collaboration between engineering and research - jointly developing agents and ensuring that the harness can enable the agents to perform at their best. At the same time, we also need to provide appropriate tools for the agents and ensure that these tools used by the agents have been "seen and practiced" by the model during the training phase, so that the model won't be "unfamiliar" or "make mistakes" when calling these tools in the real product environment.

Host: Let's define harness and explain why it has become so important.

Michael: Harness is sometimes also called the Agent loop - it is responsible for calling the model, sampling, and providing context: what I want to do, what tools are available, and what to do next. Then the model returns a response - usually a tool call, such as "I want to call this tool with these parameters. Please tell me the return result."

Some tools are simple, such as running an executable file and returning the stdout and exit code. We've also conducted many experiments on more complex tools, such as controlling a machine, controlling a user's laptop, which is more like an interactive terminal rather than a simple command execution. Network searches and other operations can also be performed.

For Codex, since it is a coding agent and we attach great importance to security and sandbox mechanisms, one of the core tasks of the harness is to obtain shell commands or computer operation instructions from the model and ensure that they are executed in the sandbox or follow the user - set policies. This part is actually very complex. The key is to unleash the full capabilities of the model while ensuring safe operation on the user's machine.

Host: How did you handle security issues when open - sourcing Codex?

Michael: These implementations can actually be seen in our code repository. We've made different treatments for different operating systems: on macOS, we used a technology called Seatbelt. On Linux, we used a series of libraries - including Bubblewrap, seccomp, and Landlock. On Windows, we actually built our own sandbox. Some of these components, such as Seatbelt, are part of macOS, so they are not in the open - source code repository - that's how we call it. But our Windows sandbox code is in the open - source code repository. We'll coordinate all these calls to ensure that they pass through the sandbox in an appropriate way to adapt to different tool calls.

Host: So when someone forks Codex, are these security rules also included?

Michael: Yes, but here we need to distinguish between "security" and "safety". What I mentioned earlier is more about security, such as you can run tools but only access specific folders. And what the industry calls safety mostly happens at the backend - that is, whether the model itself will make appropriate tool calls. From the perspective of the harness, it is more about executing commands, and which commands are safe is determined by the model.

So, if you fork Codex and continue to use our model, then you also inherit this part of the security. But if you change to another model, the situation may not be the same.

How has Codex developed?

Host: Since you launched Codex, how has it developed?

Michael: The response has been very good. The usage has increased by about five times compared to the beginning of the year. We launched it as part of the o3 and o4 mini releases in April 2025. At that time, the model was not ideal in terms of tool calls and instruction execution. After the release of GPT - 5 in August, we updated the CLI, which was a key turning point. Then we launched the VS Code plugin, and the user growth was very fast, even exceeding that of the CLI. Later, the application launched at the beginning of this year also became popular quickly. I think it is truly pioneering in many aspects.

Host: In your opinion, what are the innovative points of this application?

Michael: Developers have historically spent most of their time in the integrated development environment (IDE). These are obvious and natural choices.

Developers usually work in the IDE, so it is natural for us to enter VS Code, JetBrains, and Xcode. With the Codex application, we actually established a new interface. I regard it as a "task control center" that can manage multiple conversations simultaneously. At the same time, it retains the core capabilities of the IDE, such as viewing diffs and using the Command - J shortcut to open the terminal without switching to other windows. It really breaks the ingrained idea that you must always have all the code in front of you. For many people, being able to organize and collaborate with multiple agents simultaneously is more valuable. This is exactly the core function we strive to achieve.

How does the coding agent change developers' work processes?

Host: How will a coding agent like Codex change developers' daily work?

Michael: The biggest change is throughput. You can advance many tasks in parallel. Of course, this brings some context switching, and not everyone likes it, but if mastered well, the efficiency can be very high.

I personally maintain about five copies of the Codex code repository and often switch between them. Sometimes, I just notice some small problems while doing other things and quickly fix them. And sometimes, I need to spend a whole day dealing with a major change in Codex during the breaks between meetings. Many people, even with just a five - minute meeting break, will send a message just to push a task in another direction.

Second, people are spending more time researching how to optimize this work process. Relatively speaking, all of this is very new. Should I turn what I've been doing into a reusable skill? Should I share this skill with my team members? Excellent developers always strive to optimize their inner loop, but this is a brand - new inner loop, and everyone is still exploring.

The third thing that has attracted a lot of attention is code review. The number of code reviews has increased significantly, but Codex itself also undertakes a large amount of code review work, which saves a lot of time. How to make the most of these resources is still an ongoing exploration.

Host: Did you encounter anything unexpected when you first developed Codex?

Michael Bolin: My biggest feeling is that technology is developing too fast. Codex has been around for less than a year, and considering the huge changes that have taken place during this time, it's really amazing.

When we released it in April 2025 as part of the o3 and o4 release plans, we used an inference model, but the tool calls and instruction execution were not up to our expectations. It's really gratifying to see this aspect improving over time.

One of the most exciting things in the early days was to let Codex write more code by itself - witnessing this process. For example, the agents.md gradually became a standard, setting up a framework that allows you to build tools to optimize your own work process. This brought an exponential leap, which was both exciting and fun. Seeing colleagues really understand Codex and transfer more work to it - it's really great.

Code repositories in the era of agents

Host: What should a code repository look like when it is read by agents rather than humans?

Michael: An interesting phenomenon in the entire journey of agent - based coding is that some practices that have long been considered best practices in software development have never really been implemented by us. Documentation is an example, and test - driven development is also the case. People don't completely ignore them, but they always feel that the cost outweighs the benefit. And now, in a world where agents are prioritized, these have become very valuable. People are almost rediscovering them and truly valuing them.

For example, think about the agents.md file. All the content we write in it, I think, is also applicable to new team members - everything they need to know, all the best practices. Writing these things down not only facilitates the agents but also your teammates, which is actually a relief.

That is to say, on Codex, we believe we have embraced the concept of artificial general intelligence (AGI) - which means that agents should truly decide autonomously what to do, rather than us constantly instilling instructions in them. Instead of writing a document that runs parallel to the source code and is prone to duplication or inconsistency, we let the agents spend time reading the code and forming their own judgments. We'll try to add some information in the agents.md file that they can't quickly obtain from the code, such as how to run tests or which tests are more important than others. But we try to avoid excessive intervention and let the agents decide the best execution path on their own.

Host: Do you think agents.md will be written by agents themselves in the near future?

Michael: Many people are already doing this, such as adding "Update agents.md after completion" to the instructions. Our team doesn't enforce this, but it's a common practice.

Michael: Many people are indeed doing this now. I've seen many developers add a similar requirement to their prompt descriptions: after the task is completed, update the agents.md file and add the content worth recording during the process - including information

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The entire industry is frantically embracing Harness. Anthropic is still ramping up, but the person in charge of Codex says it is on the way out.

Should Harness be expanded or scaled down?

About AI Coding and Harness Engineering

How has Codex developed?

How does the coding agent change developers' work processes?

Code repositories in the era of agents