Harness: Popular Now, Soon to Be a Thing of the Past?

What may engulf the "harness" could be a calmer and more patient model.

As the complexity of tasks increases, the context of the Agent (intelligent agent) expands infinitely. Amidst the endless historical conversations, tool - call outputs, intermediate steps, and error messages, the model gets confused and starts to skip steps, ignore information, or take detours.

This has always been the interpretation of the difficulties that context brings to long - range tasks. The problem lies in its excessive length.

The emergence of Harness Engineering is largely an attempt to address this issue. A fundamental premise of Harness is the belief that the model will inevitably degrade in long contexts.

In the past fifteen months, the entire industry has evolved from the pure text memory of AutoGPT to the CLAUDE.md and sub - agent system of Anthropic Claude Code. People have built an entire set of engineering scaffolds specifically to suppress the model's out - of - control behavior in long contexts. This approach is called Harness Engineering.

But what exactly is degrading? What is the underlying mechanism of step - skipping and ignoring? There have been three rounds of answers, which have also led to different engineering solutions.

It wasn't until April 2026 that Gleb Rodionov from Yandex published a paper titled "Reasoning Shift" (i.e., how context subtly shortens the reasoning of large models), which provided a more fundamental answer.

01 Three Layers of Scaffolds Can't Prevent the Fourth - Layer Crisis

Regarding why the model performs poorly in long contexts, the industry has developed three layers of explanations in the past three years, and each layer has its corresponding engineering scaffolds.

The first layer attributes the problem to retrieval failure. In 2023, Stanford pointed out in "Lost in the Middle" that the model forms a U - shaped attention curve in long texts, and the middle area is ignored. The industry's response is RAG, which cuts long texts into pieces and feeds the most relevant segments through vector retrieval.

The second layer overturns the first. A 2025 paper titled "Context Length Alone Hurts LLM Performance Despite Perfect Retrieval" conducted an experiment: covering all irrelevant content and forcing the model to only look at the required information, yet the performance still declined by 13.9% to 85%. Even when all irrelevant content was replaced with blank characters, the result remained the same. The problem is not the inability to find information, but the pure length of the context itself is harming the reasoning.

The industry's response is Context Engineering. It compresses the context, manages windows, condenses history, and strictly controls the number of Tokens.

The third layer comes from a joint study by Microsoft and Salesforce (2025 ICLR). They found that when feeding the complete instructions to the model in multiple rounds, across six tasks and fifteen models, the average performance dropped by 39%. Once a wrong step is taken in one round, the model will be completely lost thereafter.

The industry has built the most core heavy - duty defense in Harness: handover management, regular mandatory verification of intermediate results, and using the code repository as the sole source of truth, never allowing the model to rely on its own memory of what happened in the previous round.

Three layers of problems, three layers of scaffolds. However, these are only discoveries at the phenomenological level.

Looking back at the second layer, researchers found that the length itself is harmful, regardless of the information quality. As for why this is the case, they have no answer. Without finding the root cause, the industry can only physically control the length.

But what if the root cause of the problem is not the length itself?

Anthropic found that the model will cunningly skip steps, not follow instructions, and gloss over areas that require in - depth analysis in long contexts. The Todo list, Checkpoint, and sub - agents in Harness are all efforts to combat this behavior.

The previous explanation was that the context was too long and the model missed something. But is the performance of mainstream models with a context length of one million Tokens just an illusion? Is it possible that this degradation is actually the model being lazy?

Rodionov's paper aims to verify this conjecture.

02 Using Shakespeare to Find Evidence of the Model's Loafing

Rodionov's experimental idea is extremely straightforward.

For the same Olympiad math problem, they simulated several real - world scenarios that an Agent might encounter: a clean baseline environment; two problems stuffed into the same prompt (simulating multiple sub - tasks); 64,000 Tokens of Shakespeare's full text stuffed before the problem (simulating the accumulation of historical information); and the problem hidden in the second round (simulating multi - round conversations).

The evaluation used 400 Olympic - level math problems, and the test covered four mainstream reasoning models.

Results: The baseline accuracy of Qwen - 3.5 - 27B was 74.5%, with an average of 28,771 Tokens for reasoning. After stuffing Shakespeare's text, the accuracy dropped to 67.8%, and the reasoning Tokens shrank to 16,415, a reduction of 43%. The situation was even more extreme for GPT - OSS - 120B, with the reasoning volume directly halving from 24,180 to 11,876. For all four models under non - baseline conditions, the reasoning Tokens systematically decreased, with the maximum approaching 50%.

Moreover, this shortening linearly intensifies as the context length increases.

It's understandable that the accuracy drops, but the sharp decline in the reasoning volume is extremely abnormal. When facing more difficult situations, the model should think more.

So, was the model confused by Shakespeare's text?

On the contrary. In the paper's appendix, the model wrote: "Let me think if there's a trap here. Is this problem from Shakespeare's Coriolanus? Wait, no, the original problem is a math problem." When doing geometry problems, it wrote: "This has nothing to do with the geometry problem. Focus on geometry."

Every mention of the interference item was extremely brief and dismissive. The model clearly knew that Shakespeare's text was irrelevant and accurately separated the signal from the noise.

The other two modes lead to the same result. In the "sub - task" mode, once the first task is completed, the model's cognitive input for the second task further contracts. The baseline accuracy of Qwen for a single problem is 74.5%, but in the parallel state, the accuracy of the second problem drops directly to 58.0%; for Gemini, the baseline is 82.8%, and the second problem drops to 65.8%. The "multi - round conversation" mode also triggers the same mechanism.

In any case, as long as it deviates from the clean single - task baseline and the cognitive space of the context becomes crowded, the model will contract its cognitive input.

Just like a contemporary person who is intolerant of long texts, the model gets a headache when seeing long texts and simply stops thinking.

03 The Model Isn't Confused; It's Just Too Lazy to Check

Where exactly does the shortening of reasoning occur?

Researchers recorded frame - by - frame the position where the model first wrote the candidate answer under the baseline and long - input conditions for 500 math problems. Under the baseline condition, it was an average of 925 Tokens, and under the long - input condition, it was 939 Tokens. There was almost no difference.

The speed at which the model finds the answer doesn't change at all. What really changes is what happens after finding the answer.

Under the baseline condition, the model has a 43% chance of continuing to check and verify after giving the answer. Under the long - input condition, this proportion drops directly to 32%.

To completely isolate variables, researchers designed a "game save - loading" experiment. First, let the model solve problems under the long - input condition. After writing the reasoning, forcibly cut off the last 50 Tokens to create a common "save point". Then, feed this identical semi - finished reasoning back to the model and let it continue writing. The only difference is that there are three different lengths of interfering texts in front.

Without any filler text, the model stops thinking in 21% of the cases. When 128 Tokens (two or three sentences) are added, the proportion of stopping work rises to 26%. When 16,000 Tokens are added, 46% of the cases directly give the answer and stop.

Even if the reasoning is exactly the same, the longer the new context, the more the model tends to think "that's about it".

The word - frequency data is more intuitive. The frequency of "wait" drops from 11% in the blank condition to 5% when there are 16k Tokens. "But" drops from 46% to 20%. "Maybe" drops from 23% to 9%. All words representing hesitation and self - doubt are reduced by half or more.

Another notable data point: when there is 0 Token of interference, the reasoning length is about 8000 Tokens. After inserting only 128 Tokens of irrelevant content, it drops sharply to 6500. The length of two or three sentences cuts off 18% of the reasoning depth. The decline from 0 to 128 Tokens is even greater than that from 8k to 64k.

Extremely minor context pollution can trigger this cognitive saving mechanism.

It's a very sensitive form of laziness.

04 The Stronger the Reasoning, the More Likely to Be Lazy

Even more terrifying is that the smarter the model, the more likely it is to be lazy.

Ali's Qwen - 3.5 - 27B has two modes: normal response and in - depth thinking. Under the long - input condition, the normal mode shortens by 19%, while the in - depth thinking mode drops by 53%. The stronger the mode, the more severely it is compressed.

The open - source model OLMo3 from AI2 provides more direct evidence. It has made public all four training - stage archives from the basic version to the strong - reasoning version. The weakest version has a very slight shortening amplitude under non - baseline conditions. As the reasoning ability is strengthened at each level, the shortening amplitude rapidly increases to 22% and 27%. The final strong - reasoning version shrinks by up to 40%.

This is the case for each training stage and each interference mode. The stronger the reasoning ability is trained, the deeper the degree of laziness.

05 A $9 Task Requires a $200 System Patch

When the model stops checking itself, it naturally skips steps. When it stops re - considering, it naturally ignores things. Harness can control the consequences of step - skipping from the outside, but the root cause is deeply embedded in the model.

The model in long contexts is not disturbed by noise, nor can't it find information. It makes an active cognitive decision: think less. It doesn't report errors or be honest; it just confidently throws out a perfunctory answer.

The industry's narrative in the past two years has been "the larger the window, the better".

But this paper proves that each additional Token of context imposes an implicit tax on the reasoning depth. For a task with a reasoning cost of $9, due to the model's step - skipping, it costs an additional $200 to build RAG, Harness, and sub - agents to make up for it.

The entire industry has been paying for the model's laziness.

Moreover, this may be a structural incurable disease.

The paper's data clearly shows that the stronger the reasoning ability, the deeper the cognitive compression. Harness developers can modify memory compensation and protocol compensation, but the heavy - duty scaffolds for disciplining cognitive discipline are more difficult to remove as the reasoning ability strengthens.

This issue cannot be solved at the engineering level.

In the past two years, the most heavily - invested context expansion has used methods such as position encoding extrapolation (to make the model understand Tokens at farther positions), sparsification of the attention mechanism (to reduce the computational volume between distant Tokens

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Just as Harness has become popular, it may soon become a thing of the past.

01

Three Layers of Scaffolds Can't Prevent the Fourth - Layer Crisis

02

Using Shakespeare to Find Evidence of the Model's Loafing

03

The Model Isn't Confused; It's Just Too Lazy to Check

04

The Stronger the Reasoning, the More Likely to Be Lazy

05

A $9 Task Requires a $200 System Patch