Harness: When Models Cease to Be the Bottleneck, AI Problems Just Start

The next watershed for AI lies not in models but in systems.

Recently, a term that was originally mainly used in engineering contexts has begun to frequently appear in the Chinese tech circle - Harness.

Similar to many technological concepts, its spread speed is often faster than the speed of understanding.

In different contexts, people's understandings of it vary: some regard it as an engineering implementation method of an agent, some understand it as a certain AI runtime, and there are also views that it is just an extension of prompt engineering.

An issue that hasn't been clearly explained: AI can do it, but not stably

In the past two years, the progress of AI has been almost entirely driven by model capabilities.

Stronger reasoning ability, longer context, and more complex multi - step execution ability have enabled agents to rapidly approach the "usable" level in terms of capabilities. However, in some real - world system practices, a problem has started to appear repeatedly: these tasks can be completed successfully once, but are difficult to reproduce stably; they can be close to correct, but deviate under boundary conditions; even though AI capabilities are increasing, the execution becomes unpredictable...

In these cases, the problem often does not come from the model itself, but from the lack of sufficient constraints in the system. Thus, a new consensus has begun to form: the model determines the upper limit of capabilities, while the system begins to determine whether the results can be reproduced.

It is also in this context that a term originally used in engineering contexts has begun to be repeatedly mentioned:

Harness.

Some have succeeded before the discussion even started

Before the term "Harness" was coined, there was no unified concept for this layer of capabilities. Instead, they appeared in different forms: coding agents, deep research agents, and multi - agent orchestrators. In engineering practices, these capabilities operate independently. However, once they enter complex task scenarios, problems start to emerge: the execution path is uncontrollable, the results are difficult to reproduce, and the system is difficult to converge.

Rather than saying it is a limitation of model capabilities, it is more accurate to say that it is the result of the lack of system constraints.

According to 36Kr, an AI system team spanning China and the United States has long been focused on transforming model capabilities into stable and executable systems. The core members of the team come from MIT, CMU, and Meta's large - model team. The founder, Luke Wang, once conducted NLP research at the MIT Media Lab, guided by the then - Chief Data Scientist of Twitter. His research direction has long revolved around the combination of language models and the system execution layer. The MIT Media Lab is one of the most influential interdisciplinary laboratories in the world, and has long been at the forefront of exploring new computing paradigms. Tools such as the Scratch programming education tool, the research direction of "affective computing", and the development path of wearable devices were all born here. In continuous engineering practices, the team founder, Luke Wang, gradually reached a conclusion:

The problem with AI is no longer about capabilities, but about convergence.

It is also based on this judgment that the team's approach has changed - instead of continuing to optimize the model or prompts, they start from the system layer in reverse:

How to structurally constrain AI behavior.

Rather than saying they are "researching Harness", it is more accurate to say that they were forced into this situation step by step by this problem through repeated failures.

An internal project achieved unexpected results

About a year ago, this team started with an internal project (Mynora.ai) and began to face a recurring problem head - on: The execution of agents in complex tasks cannot converge stably.

They built an intelligent contract Coding Agent that emphasizes code security and system stability and verified the problem in the most difficult scenarios: complex tasks, long - chain execution, and high - risk environments. The only goal was: Verify whether the system can "constrain AI" rather than "guide AI".

Within less than a month of its launch, this project quickly spread in the North American developer community and "occupied a position" in a very specific scenario: at the ETHGlobal New York Hackathon, nearly 50% of the teams chose to use it for intelligent contract development.

During the same period, it topped the weekly list of Product Hunt in October 2025 and ranked second in the monthly list of developer tools. Its stability performance at the underlying code execution layer (especially for system - level languages such as Rust) was already better than that of similar products including Cursor at that time.

This shows that it has begun to be used as the default tool in high - constraint scenarios.

Stability is "grown"

For such systems, "having done" and "having understood" are often two completely different things.

In nearly a year of continuous practice, through continuous iterations, the Luke team gradually accumulated a set of engineering experience on how to stably converge agent behavior. However, this is not a capability that can be designed. It mostly comes from repeated failures, corrections, and long - term observation of system behavior.

In this process, the team's focus has also changed. They are no longer just optimizing a certain type of task, but constantly testing the stable boundaries of the system in different scenarios. From software execution to more complex interactions and device environments, the forms of constraints, execution paths, and feedback mechanisms are all being rewritten layer by layer. It is through such repeated trial - and - error that a consensus has gradually become clear:

Stability is not designed, but "grown" through continuous trial - and - error.

This type of capability is difficult to achieve through a single design. It is more like a gradually formed system capability. Therefore, Harness engineering presents a very unique state. On the one hand, the progress is extremely fast; on the other hand, it is always accompanied by uncertainty.

In Luke's words: This direction is both disturbing and fascinating.

Why do rules fail?

The first reaction of many teams is to add rules: through system prompts, instructions, or constraint documents, they try to guide the agent's behavior onto the expected track.

However, they will soon find that the rules will be systematically violated. The reason is that the rules are more "understood" than "executed".

In a probabilistic system, an agent can understand the rules, repeat the rules, and even know "what to do", but this does not mean that it will stably execute in this way.

This also means that the rules themselves cannot constitute real constraints.

From "rules" to "environment"

The key transformation of Harness lies in: Instead of making AI remember the rules, make the wrong paths impossible to occur.

In continuous engineering practices, a change has become obvious: the team no longer tries to make the agent remember the rules, but through system design, makes certain wrong paths structurally impossible to occur. The form of constraints has begun to change:

· From text to system

· From instructions to environment

· From "prohibition" to "impossible to occur"

Under this framework, constraints no longer rely on understanding, but are directly reflected in the execution structure. If a prompt tells an agent what not to do, then Harness is more like making it impossible to do so.

Therefore, in some systems, there is a phenomenon that the agent seems "smarter". However, at the engineering level, a more accurate explanation is that the execution environment of the system has become more certain.

A new system layer is emerging

Harness is not a suddenly emerging concept. It is more like a combination of multiple mature engineering systems after the emergence of LLMs. Sandbox, runtime control, type systems, distributed constraints, tool API design...

These capabilities have always existed, but were scattered in different fields. It was not until the emergence of agents that these capabilities began to point to the same problem. In traditional software, behavior is deterministic; while in agent systems, behavior is probabilistic.

When behavior becomes a probability distribution, in some engineering practices, a trend has gradually emerged: constraints must enter the structural layer.

Harness is essentially this layer of structure.

Why now?

The concentrated discussion of Harness is not accidental. It is more like a stage signal: the bottleneck of AI is shifting from the model to the system. In the past, the question was whether the model could do it, and now the question is whether the system can stably make it do so every time. It is also in this process that the value has begun to shift:

· From the model layer to the system layer

· From capability competition to stability competition

The model capabilities are gradually converging, and the system capabilities are beginning to diverge. Harness is therefore regarded as the key watershed in this change.

A system grown from failures

These understandings do not come from theoretical deductions, but more from repeated engineering practices.

A typical example is that even if it is clearly specified that Python must be executed using uv run, the agent may still bypass the constraints through python3, subprocess, or path mechanisms.

In this process, the team gradually realized that the written rules do not equal the constraints in the system. Therefore, the constraints began to be "solidified" layer by layer. From text prompts to execution interception, and then to runtime control, the system gradually changed from "suggestions" to "structures". Until a certain stage: errors are no longer prohibited, but impossible to occur.

From "can do" to "can always do"

As agents enter more real - world application environments, the problems have started to change. The focus is no longer on "what else can be done", but on "whether it can always be done". When the task chain becomes longer, the environment becomes more complex, and the running time becomes longer, the model is no longer the bottleneck; the system is.

The real challenge has also shifted to: how to continuously complete correct execution without interruption.

Conclusion: The watershed has emerged

If the core of AI in the past two years was capabilities, then in the future, the watershed will no longer belong to the model. What really makes the difference is who can build a system that enables AI to execute stably.

And Harness is becoming the name of this layer.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Harness: When models are no longer the bottleneck, the problems of AI are just beginning.