Understand Harness Engineering in One Article: Find the Shell That Keeps AI on the Right Track from 14 Engineering Articles
In the first quarter of 2026, the most dominant buzzword in the application layer of large models was definitely "Harness".
In March this year, LangChain published an empirical article titled "The Anatomy of an Agent Harness", which completely ignited everyone's anxiety and fanaticism. They cited an experimental data comparison in this report. Just by replacing the same large language model with a more sophisticated Harness architecture, its pass rate on Terminal Bench 2.0 (an authoritative list specifically measuring AI programming capabilities) directly increased from 52.8% to 66.5%.
In this experiment, not a single byte of the weights of the underlying model was changed, and the computing power engine remained completely unchanged. Just by replacing it with a sophisticated "shell", the ranking soared from outside the top thirty to the top five.
Since then, countless startups have begun to frantically package their own shells. Harness seems to have become the magic that turns stones into gold, and has become the most presentable "dinner table ammunition" and moat for application layer companies when meeting investors.
However, in this fanaticism, the boundaries of the concept began to be infinitely stretched and blurred.
What exactly is the real shell, and what is outside the shell? In order to pursue comprehensiveness, many external introduction articles have bundled the rise of CLI tools (command-line tools), the floating of markdown files, and even the recently popular external Skill skill packages into the Harness basket. In a sense, this is correct, because they are all technical choices and creations that allow the Agent to run better under the general logic of Agent infra.
However, if we want to truly understand the hidden thread of the technological evolution of Harness and its main axis, we still need to trace the history of the emergence of this concept.
In addition, as time has entered the present. If you keep a close eye on the Anthropic team, which was the first to systemize Harness, you will find that while the entire industry is desperately adding bricks, they have quietly started to tear down the walls.
With the iteration of the new version of Opus, they began to unhesitatingly remove the control components that they had painstakingly built before.
On one hand, there is a crazy addition, and on the other hand, there is a decisive removal. This industry fanaticism full of a sense of fragmentation is essentially because the vast majority of people have not truly understood the engineering papers that explored the pitfalls in the past fifteen months.
Everyone only saw the huge profit of doubling the final scores, but they didn't see at all what kind of desperate bugs forced out those complex mechanisms back then.
Today, we will completely break open this black box. Follow the painful literature of these fifteen months and see every real blueprint of this "constraint engineering" (Harness engineering).
01 The First Layer of Harness: From Notepad to Management System
It's actually not difficult to explain Harness. Just imagine the Agent as a car.
The model is the engine, with high horsepower and high rotation speed. It roars when you step on the accelerator. The interactive program that carries it is the wheels, and the steering wheel is your Prompt. When you turn the steering wheel, the engine takes you forward. However, the engine, steering wheel, and wheels themselves are not a car. You can't drive an engine on the road. You need a gearbox and brakes to adjust the wheels smoothly according to your direction, a dashboard to tell you how far you've traveled, and a brake to tell it when to stop. All these things together, how to break down tasks to run smoothly, how to record progress, and how to judge completion, this is Harness, the shell.
The shell didn't come out of nowhere. It has a pre - history.
Large models only have one kind of memory by nature, the context window. When the window is full, the previous content is squeezed out.
This is actually not a problem for short - term tasks. In December 2024, Anthropic published an engineering blog titled "Building effective agents", and the core suggestion was just one sentence. Start with the simplest solution and only add complexity when necessary. At that time, most Agent tasks were short - sprints, completed within a few minutes. The short - term memory capacity of the model was sufficient, and a well - written instruction (System Prompt, that is, the "role instruction manual" pre - stuffed into the model) was enough to drive it to work.
But everyone wants the Agent to do bigger jobs.
In the first half of 2025, as the reasoning ability of the model improved, the tasks it could theoretically execute began to become longer. However, there was a big problem with the context at this time. Although the context of the model is quite long now (for example, 1 million tokens), in fact, its effective attention range is not that large. Even if it is that large, it can't hold all the details of long - range tasks. People pick out the key points when remembering things, but the model can't do much. Its memory in complex work is almost like that of a goldfish.
In order to solve the problem that the effective context was pitifully small in the past and the tasks couldn't be completed, one of the earliest paths was to externalize memory. In March 2023, AutoGPT gave the model a blank notebook - the permission to call a write_to_file and read_file tool, and then let it manage its own memory. The carrier was a pure .txt file without any structural constraints. The model could write and delete whatever it wanted.
However, without management, the model would naturally write randomly. In March 2024, Devin upgraded the notebook to a structured panel and introduced a structured Planner panel. The task planning of the model was forced to be output to a visual progress bar, and each step had a clear status mark.
In February 2025, Claude Code was born. It productized all the experience accumulated by Anthropic internally on SWE - bench. The combination of CLAUDE.md (project - level instruction file) + scratchpad (draft book) became the most widely imitated paradigm in the industry.
However, even with such an externalized memory system, the context might still not be enough.
For this reason, in September 2025, Anthropic's own application team published an article titled "Effective context engineering for AI agents". In this version, aiming at the context problem, three directions of solutions were proposed. There were only two tricks, which were to make long - range tasks also completed in one context layer by improving efficiency and compression.
The first trick was to improve the efficiency of the context, that is, to change the way of writing the context. First of all, the system Prompt should not be "just write a paragraph and that's it", but should be maintained like code, with version control, A/B testing, and dynamically assembling different prompt modules according to the task type. This was more efficient. Then, change the tool description, because unclear and incorrect tool reading was not only inefficient but also occupied the context. They found that the way the model reads the tool description is exactly the same as the way it reads the system prompt. The naming, parameter description, and return value format of the tool all directly affect the decision - making quality of the agent. A poorly written description is like giving a goldfish a confusingly marked map. Then, use external storage (RAG), taking what is needed when needed, instead of stuffing everything in at once.
The second trick was context compression and elimination. When the conversation was too long, make a summary compression, condense the previous conversation history into a summary, and free up token space for the subsequent tool call results. In order to prevent context overflow, Anthropic simply set a sliding window strategy, only retaining the original text of the most recent N rounds of conversations, and replacing the earlier ones with summaries. At the same time, let the agent maintain a structured work note area in the context, updating it at each step to avoid information being "washed away" in long conversations. Then, directly delete the useless content in the tool return call to prevent it from becoming a useless burden in the context.
This is Context Engineering, which manages information. It is mainly responsible for where to store information, how to retrieve it, and how to select it. They don't manage the process. Whether the goldfish model actually flips through the notepad after getting it, whether it does what is written on it after flipping through, and whether there is someone to accept the work after it is done.
This difference was not clearly realized by anyone at that time. Because Anthropic itself also fell into the same pit.
In November 2025, they disclosed this experience in the article "Effective harnesses for long - running agents". In May 2025, Anthropic wanted Claude to write a complete web application from scratch. It was not about fixing a bug, but building an entire product. This kind of task takes several hours to run. Even with externalized memory, the context window simply couldn't hold the whole process. Every time a new run was started, the memory of the previous round was cleared. It was like engineers working in shifts without handover documents.
At first, they built the first - version work framework according to the idea of Context Engineering. The approach was divided into two steps. First, send an Agent to start, analyze the requirements, break out more than 200 functional points, and generate a structured list. Then, send another Agent to write the code. Do only one function in each round, submit it after completion, update the progress file, and leave it for the next round of itself.
The notepad was issued, and externalization was done. The best practices of Context Engineering were also followed. It sounded reasonable.
But in actual operation, it was a complete failure.
They found that there were four failure modes here.
The first one was premature submission. The Agent announced that the "project was completed" after doing three functions. It thought the work was done when it saw the existing code volume.
The second one was the environmental blind spot. The Agent was really writing code, but there was a bug in the environment, and the code it wrote couldn't run, but it didn't know it.
The third one was false completion marking. The function list was marked as done, but the actual function was broken. The Agent ran unit tests after modifying the code and thought it was okay, but in fact, it couldn't run end - to - end.
The fourth one was the amnesia intern syndrome. Each new run (Session) spent a large number of Tokens to re - explore the project structure, like a new intern repeatedly asking "which folder the code is in".
So they found that the notepad of Context Engineering only solved the problem of "inability to store". But the problems of the goldfish were far more than just being unable to store. Sometimes it didn't flip through the notepad, and even when it did, it often didn't do what was written on it. In addition, it lacked the ability of self - verification.
The problem was not the notepad. The problem was that no one forced the goldfish to flip through it and follow it, and no one verified whether what the goldfish wrote was true.
This cognitive leap made Anthropic's response completely shift from "making a better notepad" to "building a complete set of management systems around strictly following the work process".
Aiming at false completion marking and premature submission, Anthropic realized that it couldn't rely solely on the externalization in Markdown format, and the Agent couldn't be both the athlete and the administrator. At the beginning of the project, a special "initialization Agent" generated a complete functional list in JSON structure (a machine - readable data format). It was designed so that the "coding Agent" that actually did the work could only change one field to mark "pass" or "fail" in a strict and fixed process. You couldn't delete functions, change descriptions, only mark the status. And according to the JSON regulations, the Agent must change the status to passing after it actually passed the test itself, and couldn't mark it as completed just because "it seemed almost done".
In this setting, the JSON used by the question - setter became a physical anti - cheating lock. By strongly verifying this data, the progress bar was firmly locked. And the Markdown - formatted file still exists, but it is mainly used to provide road signs, rather than a strict process. (This is also one of the reasons why the compliance of Skill is poor now)
Aiming at the amnesia intern, a three - step awakening ceremony was enforced at the beginning of each Session. Run pwd (confirm the current directory), read git log (check the code change history), and read progress.txt (check the next task). Just like when workers change shifts in a factory, the next shift of workers first flips through the handover book. The Agent's memory doesn't exist in its own mind, but in the Git history and progress file. If you don't trust the Agent to remember, then more systematically help it store the memory outside the body, and force it to clock in, flip through the handover book, and confirm the workstation every time it starts work.
The effect was immediate. The Agent could run continuously for several hours. Do only one thing in each round, submit it after completion, and externalize the status to the progress file. When the next round comes in, read the latest progress.txt to know what to do.
Anthropic also added a harder insurance. Archive every code change through Git. Once the model gets into a dead end, directly use git revert to roll back the code library to the previous clean state that could run, and then wake up the model again. Don't expect the goldfish to undo the mistake by itself. Just give it a time machine.
When the historical messages fill up the context window, Harness will completely clear the goldfish's mind, start a brand - new Agent, and pass the state of the previous round and the next task through a structured handover file. Anthropic calls this Context Reset - not compressing the memory, but directly replacing it with a new goldfish and only giving it a written handover note. This is more radical than simple summary compression (Compaction), because Anthropic found that even after compressing the history, the model would still be anxious and lose coherence in an extremely long context. Only by completely clearing it and giving it a blank sheet can it refocus its attention.
By this time, Anthropic's management system was quite complete. The JSON physical lock managed false marking, the three - step awakening managed amnesia, the Git archive managed undoing, and the Context Reset managed the brain capacity. But this set of systems only managed the process. The goldfish must clock in, flip through the notepad, and work according to the list.
It didn't manage another thing, whether the information in front of the goldfish was correct and up - to - date.
If the road signs written in the notepad are outdated and incomplete, no matter how strict the process is, it just makes the goldfish run more diligently according to the wrong map.
So what to do? In addition to strictly controlling the process, there is another way, which is to strictly control the notepad and its repository.
OpenAI is based on this logic.
In the article "Harness engineering: leveraging Codex in an agent - first world" (February 2026), they started an experiment from an empty repository in August 2025. Three engineers didn't write a single line of code. All the code - application logic, testing, deployment configuration, documentation, monitoring tools - was all generated by the Codex Agent.
What did humans do? Design the working environment of the Agent. In their own words, it was "humans steer, and the Agent executes".
In five months, one million lines of code, one thousand five hundred PRs (code submission requests), and zero lines of manual input. The team later expanded to seven people, and the throughput still increased.
What they got in this process was the same direction as Anthropic, but a more strict understanding, Repo - as - truth.
From the perspective of the Agent, what it cannot access during operation does not exist. The discussions on Slack do not exist