StartseiteArtikel

HSG Converses with the Founder of LangChain: In 2026, AI Will Bid Farewell to Dialog Boxes and Usher in the First Year of Long-Horizon Agents

海外独角兽2026-01-28 09:00
Coding Agent may be the prototype of all AI employees.

The article "Sequoia Capital in 2026: This is AGI" asserts that AGI is the ability to "Figure things out".

If the past era of AI was that of "Talkers", then 2026 will mark the beginning of the era of "Doers". The core carrier of this transformation is the Long Horizon Agents. These agents are no longer satisfied with immediate context-based responses. Instead, they possess expert-level features such as autonomous planning, long-term operation, and goal orientation. From coding to Excel automation, the capabilities of agents, which originally emerged in specific vertical domains, are now spreading across all complex task flows.

As the founder of LangChain, Harrison Chase has been at the forefront of this transformation. This article compiles the latest podcast interview of Harrison Chase by Sonya Huang and Pat Grady of Sequoia Capital. As a pioneer at the forefront of agent infrastructure, Harrison reveals why agents are approaching their "third inflection point" of explosive growth.

Key insights:

The value of Long Horizon Agents lies in providing high-quality first drafts for complex tasks.

For agents to make breakthroughs, an opinionated software "harness" built around the model is required. File system access rights will become a standard feature for all agents.

A general-purpose agent might just be a coding agent.

Traces are becoming the new "Source of Truth".

Compared to general models, an agent that has been fine-tuned over a long period and internalized specific task patterns and background memories will form a high "moat".

The ideal agent interaction combines asynchronous management and synchronous collaboration.

01.

The Explosion of Long-Horizon Agents

Sonya Huang: What's your view on Long Horizon Agents? Regarding Sequoia's latest article, which points do you agree with, and which do you disagree with?

Harrison Chase: I agree that they are finally starting to work effectively. The core concept of an agent has always been to run an LLM in a loop and allow it to make autonomous decisions. AutoGPT is a prime example. It captured people's imaginations because the LLM in the loop could autonomously decide what to do next.

The problem was that the models at that time were not good enough, and the surrounding scaffolding and harness were also subpar. Now that the models have improved, and we've learned what a good harness looks like over the past few years, they're starting to work. We first saw this in the coding field, which was the quickest to take off and is now spreading to other areas.

AutoGPT is an open-source autonomous AI agent framework that gained significant popularity in 2023 (the earliest classic implementation of "letting GPT think, plan, and execute on its own"). It accomplishes complex multi-step tasks by having the LLM repeatedly self-prompt (in a loop of think → plan → act → observe).

Scaffolding refers to the auxiliary code structure or framework built around the language model. It is used to guide the model's output, manage processes, or handle input and output but lacks complex autonomous planning capabilities.

A harness is a software environment that wraps the model, manages context, handles file I/O, and executes tool calls. It usually includes preset planning tools, environment interaction capabilities, and best practices, aiming to enable the model to execute complex tasks more stably and efficiently.

Although you still need to give instructions to the agent and provide appropriate tools, it can operate for longer periods. So, the term "Long Horizon" is very fitting.

Sonya Huang: What are your favorite examples of Long Horizon Agents?

Harrison Chase: There are the most examples in the coding field, and I use them the most. Besides that, AI SREs are a great example. For instance, Traversal, invested in by Sequoia, has an AI SRE that can handle long-term tasks and conduct in-depth log analysis. Research is also a good scenario because the end result is often a first draft.

AI SREs, or AI Site Reliability Engineers, are agents that use artificial intelligence to automatically monitor, diagnose, and fix software system failures. They can handle tasks such as log analysis and system maintenance.

Traversal is a startup focused on building AI SREs, aiming to use AI to autonomously solve complex software engineering and operational problems.

The issue with agents is that they can't achieve 99.9% reliability, but they can do a large amount of work and operate over longer time spans. Scenarios that require long-term operation and the production of a first draft for a task are the killer applications for Long Horizon Agents.

Coding is a typical example. You usually submit a PR (Pull Request) instead of directly pushing to the production environment, unless you're in a "vibe coding" situation. This aspect is getting better. The same goes for AI SREs; the results are usually submitted for human review. The same is true for generating reports; no one sends them directly to all their followers without first reviewing and making changes.

We've seen many such applications in the financial sector, where there are huge potential opportunities. Previously, agents only provided first-line responses. Now, there are new cases like Klarna, which takes a human-machine collaboration approach. When the first-line AI can't handle a situation and needs to transfer to a human, instead of simply dumping the problem, a Long Horizon Agent running in the background generates a summary report of the situation before handing it over to the human.

Klarna is a payment company that offers "buy now, pay later" services.

So, the core use cases revolve around the concept of first drafts.

02.

From General Frameworks to Harness Architectures

Sonya Huang: Regarding the question of "Why now", to what extent is it because the models themselves have become powerful enough, and how much is due to the clever engineering design of the harness? Before delving deeper, could you briefly define the differences between a harness and a model in your view, as well as the specific components of an agent?

Harrison Chase: Sure, I need to introduce the concept of a "framework" first. In the early days, we defined LangChain as an agent framework. But as we entered the era of "Deep Agents", I prefer to call it an agent harness.

Deep Agents are the next-generation autonomous agent architecture launched by LangChain. Based on LangGraph, they have built-in planning, file system, and sub-agent generation capabilities.

LangGraph is a low-level, controllable graph-based workflow framework developed by the LangChain team.

People often ask about the differences among the three:

Model: Obviously, it refers to LLMs, which take tokens as input and output tokens.

Framework: It is an abstraction layer built around the model, making it easy to switch models, add tools, vector stores, and memory. It is unopinionated, and its value lies in abstraction.

Harness: It is more like a ready-to-use solution. When talking about Deep Agents, the harness comes with a built-in planning tool by default. It is highly opinionated, believing that this is the right way to do things.

We need to perform compression. Long Horizon Agents run for a long time. Although the context window has become larger, it is still limited. At some point, the context must be compressed. The question is, how? There is a lot of cutting-edge research in this area. Another key set of capabilities we provide to agents is file system interaction, whether it's direct read/write or through Bash scripts.

It's actually difficult to attribute the success solely to the harness or the model because today's models are also trained on a large amount of such data (code, CLI). It's a co-evolution. If we went back two years, I don't think we could have predicted that a file system-based harness would be the ultimate solution because the models at that time weren't sufficiently trained for these scenarios.

So, it's a combination of multiple factors: the models have indeed become stronger, especially the reasoning models; at the same time, we've figured out a series of primitives related to compression, planning, and file system tools. It's the combination of these two that has brought about the breakthrough. Sonya Huang: I remember in our first podcast, you described LangGraph as the cognitive architecture of an agent. Is this the right way to understand a harness?

Harrison Chase: Exactly. We build Deep Agents on top of LangGraph. It's a specific instance of LangGraph, but it's very opinionated and more general-purpose.

Early on, we discussed general vs. specialized architectures. Now, the trend is that the specificities that were previously written into LangGraph to constrain the model are being transferred to tools and instructions. The complexity hasn't disappeared; it's just been translated into natural language. Therefore, prompting, editing prompts, and even automatically optimizing prompts have become core, while the structure of the harness itself remains relatively fixed. Sonya Huang: What's the most difficult part to overcome at the harness level? Do you think individual companies can really build a moat in harness engineering? Who do you admire in this regard? Harrison Chase: To be honest, the companies that are currently doing the best in harness engineering are all coding companies. This is where the technology is taking off. For example, Claude Code has become so popular largely because of its harness design. Pat Grady: Does this mean that harnesses are better built by the base model providers themselves rather than third-party startups? Harrison Chase: It's hard to say. Another company I want to mention is Factory, and also Amp. They are both coding companies and have very good harnesses.

Factory is a company focused on building full-stack AI software engineers. It can automatically complete the development of a full SaaS application from requirements to deployment, emphasizing the agent's autonomy and production-grade code quality. Its agent named Droid ranks first on the Terminal-Bench 2.0 leaderboard.

Amp Code is a company focused on the next-generation AI coding experience. It provides extremely intelligent code completion, editing, and generation capabilities and excels in code understanding and multi-file editing.

There are pros and cons to this. Some parts of the harness are indeed deeply tied to the model, or rather, to the model family. For example, Anthropic's models have been fine-tuned on certain specific tools, while OpenAI's models have been fine-tuned on others. Just as different models need to be adapted to different prompts, now the same goes for harnesses, which need to be fine-tuned for different model families. However, there are still commonalities, such as being based on the file system.

This is actually a very interesting phenomenon. Almost every AI coding company is now building its own harness. In current mainstream coding benchmarks, such as Terminal-Bench 2.0, the agents with harnesses are listed separately from the models. The performance of the same model can vary greatly. Claude Code isn't always at the top of the leaderboard. This shows that model providers aren't necessarily inherently better at this. As long as third-party developers have a deep understanding of the model principles, they can achieve significant performance improvements at the harness level.

https://www.tbench.ai/leaderboard/terminal-bench/2.0

Terminal-Bench 2.0 is currently the most rigorous open-source benchmark for AI agent terminal/command-line capabilities, maintained by the tbench.ai team. It includes 89 carefully selected real-world terminal tasks (involving code compilation, model training, server setup, and complex multi-step operations in fields such as biology, security, and gaming) and is specifically designed to rigorously evaluate the end-to-end problem-solving capabilities, tool usage, and long-term autonomy of AI agents in a real sandbox terminal environment.

Sonya Huang: What do you think are the keys to making a harness work efficiently? What are the top players on the leaderboard doing right? Harrison Chase: First, it's important to have a deep understanding of the model's training data. OpenAI's models have been trained extensively on Bash command lines, while Anthropic seems to have specifically trained its models on file editing tools. Adapting to the model's characteristics is crucial.

Second, compression. When the task cycle lengthens and the context window is full, deciding what to keep and what to discard is a huge challenge and the core value of the harness. Third, the use of skills, MCPs, and sub-agents. Currently, the models themselves haven't internalized many sub-agent capabilities; it's mainly up to the harness to schedule them. For example, in our harness, when the main model calls a sub-agent, it needs to pass on complete information and instruct the sub-agent when to output the final result.

We've seen some failed cases: the sub-agent does a lot of work but only replies with "see above", and the main model doesn't receive any useful information. Effective prompting to coordinate the components is crucial. If you look at current open-source harnesses, the system prompt can be hundreds of lines long, all to solve the coordination problem. Pat Grady: Let's talk about the evolution path. You've been at the forefront of the infrastructure that enables models to be applied in the real world.

If we simplify the inflection points of the past five years, the first was pre-training (ChatGPT); the second was reasoning (OpenAI o1); and recently, with models like Claude Code and Opus 4.5, we've reached the third inflection point for Long Horizon Agents.

In the world you've built, are the inflection points different? From the cognitive architecture to the framework and then to the agent harness, what are the core transitions?

Harrison Chase: I think it can be divided into three eras:

The first stage was the early days when LangChain was just starting. The models were simple text-in/text-out, without even a chat mode, tool calling, or reasoning capabilities. All people could do was simple prompting or chaining calls.

The second stage was when model labs started introducing tool calling, trying to make the models think and plan. Although the results weren't as good as today, it was enough to make decisions. At this time, custom cognitive architectures became popular. You needed to explicitly write code to ask the model, "What should we do now?" and then follow the branches. It was more like building scaffolding around the model. The third inflection point occurred around June or July 2025. There was a concentrated explosion of Claude Code, Deep Research, and Manus. The underlying architecture is the same: running an LLM in a loop. But they cleverly used context engineering, including everything related to compression, sub-agents, and context skills. The core algorithm remained the same, but the context engineering changed. This made us realize that "this is different from before", and we started working on Deep Agents.

For the coding community, this feeling was especially strong with the release of Opus 4.5 or after people used Claude Code intensively during the winter break. There was a huge shift in December. People realized that when they threw a difficult problem at a Long Horizon Agent, it could actually solve it. At that moment, the models were good enough, and we officially entered the harness era from the scaffolding era.

03.

Is the Coding Agent the Endgame for General AI?

Pat Grady: How will things develop in the future? Harrison Chase: I'm also curious. But I'm certain that the minimalist and general core concept of an agent - "running an LLM in a loop and letting it self-orchestrate, deciding what to bring into the context" - has finally been realized.

In the future, there will be less manual scaffolding. Currently, operations like compression still rely heavily on the manual design of the harness author. Anthropic is trying to make the model decide when to compress autonomously. Although it's not widespread yet, this might be the direction.

Another important aspect is memory. In long-term tasks, memory is essentially long-term context engineering. The core algorithm is simple: run the LLM in a loop. The next point of competition will be around context engineering techniques, either by leaving the engineering to the LLM like Anthropic is doing or by introducing new types of context data. My biggest question now is that most of the successful harnesses are currently focused on coding. Even for non-programming tasks, you could argue that "writing code" is an excellent general-purpose approach. Pat Grady: That's exactly what I want to ask. Are coding agents just a subcategory, or should all agents essentially be coding agents? Since the job of an agent is to make the computer work, and code is the best way to give instructions. Harrison Chase: That's a big question. I firmly believe that when building a Long Horizon Agent