Explore the Future of Software Engineering with Anthropic

A conversation with Ash from Anthropic and Sivesh from Balderton Capital

God Translation Bureau is a compilation team under 36Kr, focusing on fields such as technology, business, the workplace, and life, and mainly introducing new technologies, new ideas, and new trends from abroad.

Editor's note: When AI review is more reliable than humans, programmers are becoming "intelligent agent shepherds". The compound interest era of software development has arrived, and the barriers of SaaS are collapsing. This article is from a compilation.

Recently, I, along with Sivesh and Ash Prabaker from Anthropic, co-hosted a round - table meeting on the future of software engineering. Also present at the meeting were engineering leaders from Stripe, NVIDIA, Microsoft, Google DeepMind, xAI, Apple, and Scale AI, as well as the legendary Peter Steinberger from OpenClaw/OpenAI.

The Origin of Claude Code

The meeting began with a retelling of the origin story of Claude Code, most of which has been mentioned in public interviews. This project started at the end of 2024 with a simple terminal interface (Terminal UI), which was very rough at first. The core principle of its development was to design for the expected capabilities of the model 6 to 12 months later, rather than being limited by the level at that time. Its popularity was spontaneous - it was a project driven by grassroots engineers (ICs), and it achieved large - scale application through the actual value it demonstrated rather than administrative orders.

The Theory of Recursive Improvement

A main thread running through the entire discussion was "closed - loop" development. A participant described their company's process: Bug reports are automatically sorted by agents, classified by severity, checked against the evaluation set (Eval set), and then a repair PR (pull request) is directly initiated - the whole process requires almost no human intervention. There was a consensus among the participants: this closed - loop is the real source of compound interest returns - better programming tools can improve the model, and a better model in turn optimizes the programming tools. Several leaders pointed out that because of this positive dynamic, their companies are prioritizing programming tasks.

The Transformation of Workflow

The participants exchanged the changes taking place in their respective engineering practices:

Test - first has become the default criterion. Many people said that they now define test cases first and then let the agents build around these cases - they believe that this is the only rational way to handle a large number of PRs.
A two - level evaluation system. A participant outlined their team's approach: one is regression evaluation (Regression evals), which must maintain a 100% pass rate in each PR; the other is frontier evaluation (Frontier evals) for new capabilities. Others present agreed with this model.
Don't force the promotion. This point resonated strongly. A participant mentioned that they guide the use through competitions, hackathons, and informal rewards instead of top - down hard requirements. He believes that forced use will breed resistance, while letting employees see the results of early adopters can naturally promote the popularization of the tool.
Code review is in flux. A participant admitted that the developers in their company often click to approve within a few minutes because the AI review layer has done an excellent job. When asked about the ultimate direction, they frankly said that the model of "must be reviewed by humans" will eventually become inefficient and hinted that in some code repositories (Repos), they may have already crossed that critical point. These words resonated with the participants but also made them feel vaguely uneasy.
The return of comments. Several participants noticed an interesting cultural reversal: at first, engineers hated the long comments generated by agents, but now the consensus has shifted to keeping these comments because the next agent will find them very useful when handling tasks. Someone summarized it as: "We are now writing code both for humans to read and for AI to read."
Life in the terminal. A participant described his personal workflow: make a plan, verify the plan, execute it through an agent, and then move on to the next task - without having to read the generated code line by line throughout the process. This sparked a debate about when this approach is safe and when it is dangerous.

Areas Still Requiring Rigorous Review

Not all code should be treated equally. The participants generally believed that any code involving destructive operations (such as data loss, privilege escalation) or core infrastructure deserves more intensive human review, while internal prototypes do not need to meet the same standards as publicly - facing code. As for where the line is exactly drawn, the standards vary from company to company.

The Bottleneck: Long - Horizon Tasks

There was a consensus that long - horizon tasks are the real challenging frontier. A participant pointed out that although product engineering has begun to grow exponentially, achieving a closed - loop in more complex scientific research workflows has not been successful. The open question they all face is: what tasks should you assign to an agent that runs for four or five hours? How to observe it? How to maintain "human involvement" without constantly staring at it? No one can give a perfect answer yet.

Infrastructure and Sandboxing Mechanism

The discussion involved the change in the industry's attitude towards sandboxing - initially trending towards sandboxing for security reasons, then abandoning it for convenience, and now returning with more details (such as remote programming agents, independent sandboxes for single sessions). The actual pain points raised include: computing power consumption for long - term sessions, privilege management, and enterprise - level deployment.

Observability and On - Call System

A participant described an early internal prototype: an agent with access to logs, source code control, and the chat system can handle the sorting and debugging of emergencies - although the system has not reached the production level, it has already reduced the on - call burden. Several participants noticed an interesting side effect: engineers without an infrastructure background can now participate in infrastructure work because the agent fills in their knowledge gaps.

Context Management

Someone asked how to manage context on a large scale when thousands of people are modifying code every minute. The honest answer from those present was that no one has figured this out. A participant admitted that their method is basically fragmented - letting the agent read temporary chat records through MCP (Model Context Protocol) and relying on a strong writing culture, but there is no formal documented process.

The meeting mentioned a study showing that pre - loaded Markdown context files are sometimes less effective than letting the agent traverse the codebase from first principles. The opposing view is that this may reflect the problem of stale context content or content generated by AI. The consensus reached is that context files written by humans are beneficial, while those generated by AI or stale files may have the opposite effect. Humans must provide core insights.

Talent Recruitment

When it came to recruitment, the most striking view was that a participant now values most not the original engineering skills but the willingness to constantly experiment with the latest technologies. Their best - performing employees are those who have a deep understanding of the limitations of the model and know when to trust the output and when to intervene manually. Another participant pointed out that due to AI - assisted cross - domain collaboration, their core infrastructure team has remained lean, allowing product engineers to contribute in areas beyond the traditional ones.

SaaS Under Pressure

The discussion in this part was very lively. The participants shared the types of tools they have replaced internally:

Event management - someone said that their team removed the vendor's tool because it was too complex for the actual way of working.
Authentication layer - a participant claimed to have migrated the authentication system several times within six months, and each migration only took a few hours instead of weeks.
Project tracking - someone is developing a customized UI on top of programming agents to manage engineering tasks and hinted that this entire category may be the next to be replaced.
Internal micro - tools - short - link generators and similar tools were mentioned by several people as "easy wins".

Everyone noticed a pattern: so far, the tools that have been replaced are all developer tools because engineers have decision - making power and execution speed in these areas. Business - oriented software (such as CRM) is more sticky. One view is that the existing business tools survive not because they are good but because no one has launched an attractive AI - native alternative yet - there are only some incremental plugins at present.

A counter - argument on - site was that the opportunity cost theory ("We should focus on what we do best") may always hold, which means that model labs may never prioritize developing SaaS alternatives over improving the model.

Problems Caused by "Everything is Optional"

A founder of a startup on - site put forward the opposite view: since AI makes everything possible, it has become more difficult to determine priorities. Six months ago, it was obviously not worth rebuilding a tool internally; now, it only takes one night. With so many things that can be done, the team is overburdened. No one can give a better answer other than defining clear "lanes" and regarding individuals as owners of "micro - companies" within the organization.

Code Quality

When someone asked about the code quality standard, the answer was that its definition is changing. "Good code" used to mean being human - centered - simple, easy to maintain, and easy to contribute to. Now, it must also consider AI readability. The practical view on - site is that strong regression evaluation and the test - first principle are more important than the aesthetics of "clean code".

Design Taste and Dross

The "purple gradient style" caused a burst of laughter - everyone recognized this typical AI - generated UI aesthetic. Someone pointed out a dilemma: if you update the model's taste preference, everyone will use it, and this new aesthetic will soon become the next - generation "AI waste". Others noticed that some models actively guide users to use specific frameworks, which actually forms a locking effect.

Convergence Risk

A participant was worried that if everyone uses the same model for programming and accepts the same advice, the entire industry will fall into homogenization of tools and patterns. The counter - argument is that this risk was greater in the early models because at that time, the models were much stronger on popular Web technology stacks than on old systems or niche languages - and now this gap is narrowing. The modernization of code in old systems is considered an area with rapid progress.

Background Agents

There was a general agreement that the development direction is asynchronous background agents - they run in remote sandboxes, can be monitored via mobile phones, and can run continuously for hours or even days. Someone mentioned that autonomous operation for several hours has only recently become their regular operation, and it was in the experimental stage before.

Model vs. Framework (Harness)

When asked how much of the recent improvement is due to model weights and how much is due to the external framework (Harness), a participant's view is that both are important, but at different paces - the huge leaps come from the progress of the model, and the design philosophy of the framework should be "don't get in the way of the model". They described a minimalist prototype - basically the current generation of models with system prompts and Bash access - and its performance was surprisingly good, which was impossible with models from a few generations ago.

Regulated Industries

Someone with a fintech background asked about deployment in a regulated environment. The on - site interpretation was that the most successful AI startups in regulated industries (taking legal technology as an example) are essentially still "human - involved" document - dialogue products. In regulated workflows, no one has achieved the leap to autonomous agents. The threshold here is asymmetric - similar to autonomous driving, AI must perform much better than humans to be accepted. Better interpretability and structured audit trails are considered the keys to opening up this field.

Multi - Agent Orchestration

The answer was surprisingly "low - tech": Git working trees and ten terminal tabs. Although third - parties are building more complex orchestration tools, no one present claimed to have solved this problem.

The Irony of Digital Transformation

Someone observed that getting engineers to adopt AI tools - finding leaders, overcoming resistance, and managing change - is exactly the digital transformation problem that other industries have faced for years. Applying this to engineers who have built the transformation tools themselves, the irony was well - understood by everyone present.

The Development Trajectory of Programming Languages

The final question was: Will agents start writing code closer to the underlying (Metal) level, bypassing the abstraction layers that exist for human convenience? The view is yes, eventually, but only if the model believes it is beneficial for performance - not because the underlying code is simpler for the model. Current models still benefit from well - structured, well - commented, and human - readable code. Someone pointed out that there is a trend among startups to migrate to Rust, partly because AI has flattened the learning curve.

Key Points

The recursive loop is real. Better programming tools produce better models, which in turn produce better programming tools. Several participants said that this is why their companies prioritize programming.
The bottleneck has shifted from writing code to managing long - horizon tasks and deploying agents in a regulated environment.
Developer tools are being replaced first. Business - side software with network effects remains stable for now.
The role of humans is shifting from writing and reviewing to planning, evaluating, and guiding - and the best - performing people are those who always stay at the forefront of technology.
The popularization of enterprise - level applications is limited by privilege management, sandboxing mechanisms, and regulatory prudence, rather than model capabilities.
When millions of people use the same model to make the same choices, the mediocrity and convergence of content are real concerns.
Context management remains unsolved. Context written by humans is helpful, while stale or AI - generated context may have a negative impact.

Translator: boxi.

本文来自翻译, 如若转载请注明出处。

Discuss the future of software engineering with Anthropic