HomeArticle

Why is Harness the next battleground for AI?

深流研究所2026-05-27 08:14
Harness is the systematic project that transforms the "engine" into a "complete vehicle".

"Agents aren't hard; the Harness is hard."

In February 2026, when Ryan Lopopolo, an engineer at OpenAI, used this sentence to summarize the project he had just completed, most people didn't understand his feelings. Leading a small team of fewer than 10 people, he spent five months getting Codex to write over one million lines of code without typing a single line by hand. The system that enables the model to work reliably is what he calls "Harness Engineering."

According to public information, the weekly active users of Codex were around 1.6 million at the beginning of March, but by May, it had exceeded 4 million.

In addition to the model upgrade, Codex's Harness capabilities have also won it many users. For example, some developers' tests found that for the same task, Claude Code consumes about 3 to 4 times as many tokens as Codex. The gap is not entirely due to the model itself but also related to the Harness design: Codex tends to break down tasks and run them in parallel, with each subtask having an independent context and not interfering with each other.

Nowadays, the AI community has widely recognized the formula "Agent = Model + Harness." If an Agent is a car, the large model is the engine that provides power. Without the engine, nothing can be done. But if a bare engine is placed on the ground, you can't drive it on the road. The Harness is the system engineering that turns the "engine" into a "complete vehicle."

Just this month, DeepSeek posted two job openings: Harness Product Manager and Harness R & D Engineer. Chen Deli, a senior researcher at DeepSeek, said on social media that this is to form a Harness team, and the direction is to "benchmark against Claude Code and develop DeepSeek Code Harness." This company, known for its breakthroughs in the model layer, is also betting its next chips on the Harness.

In the past few years, model capabilities were a scarce resource. But as model capabilities become infrastructure, it's getting harder to maintain the lead. The shelf - life of the most powerful models is getting shorter and shorter. The Harness layer outside the model is becoming increasingly important.

Model capabilities are still fundamental, but the Harness has become the key battleground in AI competition.

I. Harness Shuffles the Three - layer Structure of the Industry

The Harness starting to optimize the model in reverse is just an early sign of the reshuffle of the entire current AI industry structure.

In the past few years, the AI industry was defaulted to have a three - layer structure: the infrastructure layer, the model layer, and the application layer. Each layer has its own responsibilities, and the value distribution is relatively clear. But now, the Harness is starting to affect the distribution of this "benefit cake."

Model companies are the first to feel that a part of their "realization rights" has been taken away.

In the past, model companies both trained models and decided how the models were used. They sold APIs and Playgrounds, and the realization of model capabilities was entirely in their own hands. If the model was strong, it could be sold at a high price, and the logic was simple.

After the emergence of the Harness, this logic has loosened. Before DeepSeek officially decided to enter the Harness field, a "DeepSeek version of Claude Code" (named "DeepSeek - TUI") in the developer community was very popular, and it currently has over 30,000 stars. This is because the same version of DeepSeek can perform better when running in a fine - tuned code Harness, but its capabilities will be greatly reduced when running in a rough "shell."

The model itself doesn't change, but the Harness will affect the range of model capability realization. If the capabilities painstakingly trained by model companies are handed over to others' Harnesses, the final pricing power may end up in others' hands. It's like becoming a supplier, earning one less layer of profit, and the quality of the product being decided by the distribution channel.

The changes in the application layer are more subtle and gradual. In the past, the moat of many application companies was their understanding of the business. This "understanding" was hidden in the judgments of product managers, in the interaction details polished over the years, and in the continuously iterated functional logic. But now, these things are starting to be moved to the Harness. For example, SaaS giant Salesforce has solidified the standard actions of sales lead tracking, and Claude Code has embedded the standard process of code review. What used to be understood and accumulated by people is now at the Harness layer.

In May this year, the veteran customer service SaaS company Intercom even directly changed its name to Fin, replacing its 15 - year - old brand with the name of its own AI Agent product, and started to reconstruct around the Harness. Those application companies that haven't started to pay attention to the Harness may find that their business moats have been quietly hollowed out when they look back in a few years. Once the business understanding is solidified into executable Agent actions by the Harness, the ownership of this understanding follows the Harness, not the people.

Going further up, the infrastructure layer can't stay out of it either, because the demand in the computing power market will be re - defined in reverse.

In the past, the product planning of companies like NVIDIA was largely driven by large - scale, stable - load model training. But with the popularization of the Harness, Agent inference is becoming the new dominant force in the computing power market. Agents have the characteristics of long - link, multiple calls, tool - use, and memory - retention. Their inference loads have dynamic changes such as long - cycle and unpredictability, which require different scheduling methods, memory architectures, and network topologies. NVIDIA's Vera Rubin platform, released in 2026, is specifically built for the era of intelligent agents and large - scale inference. The Harness is starting to influence the next - generation product form of the chip layer in reverse.

These changes together mean that the interest distribution of each layer in the AI industry chain needs to be renegotiated.

II. Harness Naturally Grows in Scenarios

There is also a differentiation occurring within the Harness itself. The root of this differentiation lies in a fundamental characteristic of the Harness.

Ryan Lopopolo's team initially thought that they just needed to connect the model to the Harness, but later they found that the Harness is not a plug - and - play plugin. The Harness is not designed once and then left there; it must be refined through failures in real scenarios. Without real scenarios to correct it, the Harness will become rigid.

This is why the Harness naturally grows in scenarios. And since the business scenarios of different companies vary greatly, the Harness will also differentiate.

The code scenario is the first to be verified and the fastest to make the Harness work. Every trajectory that the Harness runs in the code scenario comes with a feedback signal, and the model can learn from it. This is why Anthropic and OpenAI both chose the code scenario for their first Harness battle.

But the world outside of code is much more complex without a compiler. In non - code scenarios such as customer service responses, after - sales service, and risk control judgments, there is no automated objective standard to instantly determine right or wrong. Without a natural validator, feedback signals either rely on manual annotation and review, which is costly and slow to iterate, or on real business results, which requires being close enough to the business and running for a long enough time. Players who do well in the Harness must be the ones closest to real business feedback.

In the long run, the model will definitely become stronger. The problems that the Harness currently faces, such as failure retry and context truncation, which require special engineering to handle, may be solved by the model itself in the future. But the part of the Harness that grows in real business scenarios and is refined through real failures cannot be replaced by even the strongest model. The strengthening of the model will eliminate the engineering layer of the Harness, but it can't eliminate the scenario layer of the Harness.

Players with real business feedback have started to show advantages in the Harness.

For example, SaaS giant Salesforce has decades of accumulated customer behavior data, sales funnel feedback, and service ticket records in the CRM scenario. Data from the latest fiscal year shows that the company's Agentforce is already charging by "Agent conversations." The ARR reaches 800 million US dollars, with an annual growth rate of 169% and a cumulative total of over 29,000 transactions, and it has achieved commercial realization.

Tencent WorkBuddy, currently the Agent with the highest daily active users in China, is also an early bettor on the Harness. It only took a week from the team's decision to adopt the claw mode to the full - scale launch. It can run so fast because WorkBuddy's Harness was built within Tencent early on. Before facing the market, WorkBuddy was used by over 2,000 internal employees. Employees entrusted their daily work such as meeting minutes, cross - departmental collaboration, email drafting, and document generation to it, and every use and feedback was accumulated back into the Harness, making it better.

However, this doesn't mean that each company is defining and manufacturing completely isolated Agent products that can only do one thing. In future AI competition, when the model needs to enter the deep waters of different industries' businesses, it must be put into different Harnesses for refinement.

These refinements not only represent differences in Agent route selection but also the reshaping of each enterprise's moat. Different scenarios such as code, collaborative office, and e - commerce transactions give rise to completely different Harnesses. Since the feedback signals in non - code scenarios are extremely difficult to replicate across industries, a Harness refined in one scenario cannot be directly applied to another scenario. Therefore, players with unique business closed - loops will build barriers in their own fields, and it's difficult for outsiders to break this lead by simply stacking computing power or model scale.

III. The Dispute between Standardization and the Agent Ecosystem

When Agents are refined in different Harnesses and develop different rules and styles of behavior, they ultimately need to "communicate with each other."

If each company uses private protocols and private calling methods, the entire Agent ecosystem will fall into the chaos of software not being able to communicate in the PC era and browsers implementing HTML differently in the Internet era. Therefore, in the future, Agent competition will definitely rise from the engineering of the scenario layer to the level of protocols and standards, which is the fundamental confrontation for large - scale interconnection of Agents.

The standardization competition of Agents has already begun. Anthropic launched the MCP (Model Context Protocol) at the end of 2024, abstracting how the model accesses tools and obtains context into an industry protocol; Google launched the A2A (Agent2Agent) protocol in April 2025, enabling multiple Agents to collaborate across vendors.

After all, when Agents start to interconnect on a large scale, the protocol network formed by early entrants will become an entry barrier for later entrants. Whoever spreads the protocol first, connects the ecosystem, and retains developers will obtain a platform position similar to Android or iOS at this level.

In China, Tencent, Alibaba, and ByteDance are all following up to avoid falling behind the de facto standards. Tencent Cloud's intelligent agent development platform fully supports MCP and has launched an MCP plugin marketplace; Alibaba's Bailian platform has accessed MCP; ByteDance's Trae and Coze are also fully embracing and compatible with MCP.

The standardization of protocols is far from just solving the problem of interconnection itself. Protocols also determine whether users can use Agents safely and trustingly, and ultimately whether large - scale commercial implementation can be achieved.

When an Agent can place orders, make payments, and sign contracts on your behalf, how can the risks in the process be controlled? In May this year, the China Academy of Information and Communications Technology, in collaboration with Tencent, Huawei, ZTE, the three major operators, and the Hong Kong University of Science and Technology, Shenzhen, jointly released the ATH protocol, which starts to address these issues. The core idea of this protocol is to determine the permission boundaries through a three - way handshake between the user, the Agent, and the service. The permissions are taken as the intersection, and if any party is absent, it cannot pass.

Simultaneous with the protocol dispute is the construction of Agent collaboration infrastructure.

When ten Agents need to collaborate, having only standard protocols is not enough. Problems such as scheduling between multiple Agents, shared memory, permission boundaries, context routing, and security sandboxes are issues that cannot be completely solved at the protocol level and require a set of underlying infrastructure to support them.

There is currently no consensus on what this layer of infrastructure will ultimately look like. One possibility is that it will be further integrated into existing terminals. They will first obtain screen, system computing power, and hardware permissions, and then call the Agents; another possibility is that an independent intelligent agent ecosystem will evolve, similar to Windows in the PC era or Android in the mobile era.

Another path is to grow within an existing super - ecosystem, which is also the biggest imagination space for WeChat Agent at present. Tencent executives have mentioned the direction of WeChat Agent on multiple public occasions. Although there is no official product form yet, if we consider 1.4 billion WeChat users, 4.5 million mini - programs, and the business scenarios covering from payment to government affairs, it is already a ready - made Agent collaboration network. The Agent doesn't need to "set up a new venue"; it can access the already - running real business and just follow this network.

Protocols define how Agents communicate with each other, and infrastructure is responsible for enabling Agents to run stably. It can be seen that the leading AI companies are currently considering both of these things to prepare for seizing the competitive advantage in the Agent era.

Conclusion

In the past, when evaluating the competitiveness of an AI company, people were used to looking at how strong its model was, how high its score on the leaderboard was, and how much money it burned. But these questions can only tell you whether there is an "engine" and "how well the engine is made."

However, the industry has now realized that this set of evaluation methods is not comprehensive and practical enough. For a complete vehicle to run on the road, it also needs a safe and usable "vehicle system." When model - layer companies like OpenAI and DeepSeek are also supplementing the Harness capabilities required by Agents, it actually reveals a new evaluation method for AI competition: Can the Harness optimize the company's own model in reverse? Is there real business scenario feedback? Can it gain a position in the Agent standardization dispute? Is there a basic base to support the collaboration of multiple Agents? Etc.

The model is still the foundation, but as the influence of the Harness expands, each AI company not only needs to answer "how strong my model is" but also figure out where it wants to stand in the new AI landscape stirred up by the Harness.

The AI era is changing rapidly, and the Harness may just be a beginning. In a few years, it may have a new name, and its specific form may also evolve. But between the model and the scenario, there always needs to be an intermediate layer that connects the model, embeds in the business, and accumulates feedback.

What this layer accumulates is not only engineering capabilities but also business understanding, feedback data, and the daily mutual calibration between a company and real users. At present, there is no shortcut for this; it can only be accumulated over a long enough time and refined in a large enough real business.

This article is from the WeChat official account "Deep Flow Research Institute", author: Xiao Ying, published by 36Kr with authorization.