HomeArticle

What kind of distributed infrastructure is needed in the Agent era?

极客邦科技InfoQ2026-05-10 10:53
Since the outbreak of the current wave of large model technology, Agents have received extensive attention. After entering 2026, along with the phenomenal popularity of OpenClaw, Agents have completely broken through the niche and entered the broader public view.

The era of Agent applications is just around the corner.

Since the recent explosion of large model technologies, Agents have received extensive attention. After entering 2026, with the phenomenal popularity of OpenClaw, Agents have completely broken through the circle and entered the broader public view. Meanwhile, if Agents were more used in demos or some relatively customized scenarios in the past, with the emergence and gradual maturity of technologies such as Agent Skills in the past year, today's Agents can handle more real - world scenarios. It can be considered that the era of Agent application forms may be approaching.

The generational difference in Agent applications - Non - determinism

Before the emergence of Agent applications, whether it was the earliest stand - alone applications or the widely used cloud - native microservice applications today, computer programs truly oriented towards applications were essentially developed by humans for specific application scenarios. The logic of the programs was manually written by developers and had strong determinism. However, in the era of Agents, the specific logic of Agent operation has changed from being controlled by human programming to being generated by large models. And neither the business owners, nor the application development and operation personnel, nor even the R & D personnel of the Agent framework and the large model itself can accurately predict the output of the large model. Therefore, it is completely non - deterministic.

However, a large amount of existing infrastructure is still built for cloud - native and earlier - era deterministic applications and cannot well meet the operation requirements of Agent applications. This is likely to be a huge obstacle restricting Agents from truly moving towards large - scale enterprise - level applications in the future. At the same time, it is also a good technological innovation opportunity for R & D and innovation personnel in the infrastructure field in the Agent era.

The unique operation characteristics and challenges brought by the non - determinism of Agents

High dynamics - The Agent logic is completely dynamically uncertain and cannot be predicted in advance

Traditional applications are generally developed by humans for specific business scenarios, so in most cases, they are static. As long as the application development and operation personnel have a sufficient understanding of the program code logic, they can basically accurately predict the possible execution of the application. And no matter when and where these programs run, their execution logic can be considered essentially the same. Taking cloud - native microservices as an example, the processing logic of each microservice instance for each request is almost the same, and the development and operation personnel are very clear about this. Therefore, by packaging the microservice logic in a unified image, multiple container instances of the same specification can be deployed through K8s to support large - scale enterprise - level applications.

However, in the era of Agents, the situation has completely changed. As shown in the figure below, the execution logic of Agents is driven by large models. Facing users' strange natural - language questions, the large model may give completely different outputs each time, which in turn drives the Agent to call various external tools and even execute some code dynamically generated by the large model based on the current request input. This process continues until the large model believes that the user's question has been solved, resulting in the fact that the processing process of each request by the Agent may be completely different.

For example, some simple requests may be executed quickly and do not require much resources. While some complex requests may require multiple rounds of interaction, tool calls, execution of AI - generated code, etc. Some of the latest Agent technologies even need to launch new sub - Agents during operation, which all require more time and computing resources. In this case, the operation and maintenance personnel of Agent applications cannot predict in advance how complex the specific execution process of a request will be. For example, they don't know how many times of interaction with the large model are needed to solve the problem, which external tools will be called, or whether some AI - generated code will be dynamically executed.

In short, previous applications were simple and static, while Agent applications are complex and dynamic.

This first brings a very practical problem: how to allocate resources for Agent applications? In the era of container microservices, development and operation personnel could configure the same resources for each container microservice based on their understanding of the code execution logic and some practical experience. However, in the era of Agent applications, it has become a difficult problem to estimate how many running resources an Agent needs. Allocating too few resources may lead to operation errors or affect service quality, while randomly allocating large - scale resources to each instance will obviously cause huge resource waste.

Insecurity - Tools and AI - generated code are untrustworthy

Another characteristic of Agents is that the execution logic may be insecure. During the operation of an Agent, it needs to execute code generated by large models or call certain external tools. The execution of this AI - generated code and tools may actually bring security risks. And the isolation of traditional containers is relatively low. Once some malicious code is run, security problems such as container escape may occur.

An easily thought - of solution is to use more secure containers or virtual machines to replace traditional containers, but still connect with traditional container scheduling frameworks such as K8s through container interfaces, so that Agents can run on the existing container infrastructure and provide higher security isolation capabilities. In fact, many security sandboxes currently provided for Agents in the industry do adopt these technologies.

However, this may still not be enough. For example, in the example in the figure below, once the Agent's own logic and AI - generated code or other risky tool calls are mixed and executed in a secure container/virtual machine, even if the secure container/virtual machine isolates the risk of attacks on the host, there is still a possibility that some important private information in the container/virtual machine (such as credentials for accessing the large model) may be accessed and stolen by risky code, and security risks cannot be completely eliminated in actual Agent application scenarios.

A more reasonable approach is that once an Agent needs to execute this AI - generated code or make risky tool calls, it should be dynamically scheduled to another clean secure container/virtual machine as shown in the figure below to run, completely isolating it from the Agent's main body, thus completely avoiding risks.

However, this requires that in addition to simply supporting the static deployment of each container application during the deployment phase, the infrastructure also needs to support the dynamic scheduling and launching of new secure container/virtual machine instances and the execution of certain code tasks on demand during application operation. This task - level dynamic scheduling and execution ability is not available in the traditional K8s container microservice technology system.

Long - term sessions - How to ensure session state consistency during long - term operation

In the past, cloud - native microservices generally advocated stateless microservices for the convenience of operation and maintenance and horizontal elastic scaling of the number of instances. And the actual business logic of many applications is indeed relatively simple. For example, a lot of business data itself is already stored in the database, and the processing process of requests only needs to modify the database according to the request parameters. The execution logic itself is indeed stateless.

However, Agents naturally require state. For example, in a multi - round dialogue scenario, multiple consecutive inputs from the user need to be always processed by the same Agent instance to ensure the consistency of the context, so as to ensure that the Agent can continue to handle the requests correctly.

At the same time, Agents are constantly evolving towards handling more complex tasks, making the execution and processing process of a single request by the current Agent longer and involving a large number of external tool calls. Once an instance failure occurs during the request processing in the production environment, several rounds of Agent loops may have been executed for this request, and some external tool calls have already taken effect. In such a failure situation, simply restarting the instance and re - executing the request as in the previous microservice scenario may, due to the non - determinism of the Agent's execution logic, lead to completely different execution branches, resulting in the invocation of other different tools, causing the Agent to make multiple inconsistent external tool calls, and ultimately leading to unacceptable error execution results in business, which can cause fatal problems in enterprise production applications.

For example, as shown in the figure above, suppose a ticket - booking Agent has called a tool to book a flight ticket for a certain itinerary during the request processing before the failure. Then, before the request is completely processed, a machine failure occurs. After the Agent recovers from the failure and processes the request again, due to the non - determinism of the Agent, the actual execution logic changes, and it books a high - speed train ticket for the same itinerary. Such a failure will obviously cause huge business losses. Especially, we know that in the actual enterprise - level production environment, as long as the operation time is long enough, machine failures will definitely occur in the cluster.

In summary, the characteristics of high dynamics, insecurity, and long - term sessions brought by the non - determinism of Agents pose a huge challenge to the existing infrastructure system represented by K8s container microservices. It is difficult to truly implement large - scale Agent deployment directly based on the existing system. So what kind of distributed infrastructure is needed in the Agent era?

What kind of distributed infrastructure is needed in the Agent era?

Traditional distributed infrastructures such as K8s are actually really good at managing the cluster's resources in the form of containers and allocating them to various applications. K8s knows nothing and doesn't care about what kind of application logic runs in the distributed containers and whether the resources in the containers are fully utilized. These are what K8s users need to care about. At the same time, K8s also leaves it to the users to decide how much resources the containers need. K8s only responsible for delivering containers of the specified specifications by the users. This was not a big problem in the era of cloud - native microservices with deterministic operation, but in the non - deterministic Agent era, it naturally encounters all the challenges mentioned above.

Regarding the characteristics and challenges of high dynamics, insecurity, and long - term sessions introduced by the non - determinism of Agents mentioned above, in essence, what is needed in the Agent era is not just to manually plan and load deterministic application logic into multiple identical containers for independent deployment and operation. Instead, a more flexible and powerful distributed system is required. It can maintain the correct session state of the Agent during its long - term operation after being scheduled and launched. At the same time, it can also dynamically launch some new subtasks to run certain potentially risky code/sub - Agents etc. according to the actual operation needs, and support the sharing and transfer of some key context data between them. In addition, Agents and the subtasks/sub - Agents they launch should efficiently and dynamically utilize the resources on the cluster according to the actual operation resource requirements, without the need and ability for users to specify in advance.

After reading this, does it seem familiar? It's actually similar to running programs on a stand - alone OS. Programs can run for a long time in the form of processes, constantly accessing and modifying internal memory variables. At the same time, they can dynamically launch new subprocesses according to their own execution logic, transfer data and collaborate through RPC or shared memory, etc. All processes efficiently use the resources on the stand - alone machine according to their actual operation needs, without the need for users to specify in advance.

The only difference is that for enterprise - level Agent applications, we now need to run Agents on the cluster. So in essence, we need a distributed system on the cluster, which has the flexible and dynamic scheduling and elastic resource utilization capabilities similar to a stand - alone OS, and supports long - term stateful operation. At the same time, due to the particularity of the distributed system, it must support automatic recovery in case of failure and ensure the consistency of the state after recovery.

Is there a distributed system in the industry that meets the operation requirements of Agents?

Related work in the industry

The answer is yes. Here are some related industry works that the author thinks are relatively relevant for readers' reference.

openYuanrong

From what we know currently, the most suitable open - source system is openYuanrong[1].

The core design concept of openYuanrong is to build a distributed kernel similar to a stand - alone OS and use this kernel to uniformly support various possible distributed application loads. This is very suitable for solving the typical problems in the aforementioned Agent scenarios.

Support for high dynamics of Agents

By running Agents on openYuanrong, it can naturally support the automatic elasticity of Agent instances without the need for operation and maintenance personnel to worry about how to configure container resources. openYuanrong adopts typical Serverless automatic elasticity technology, which can support dynamically adjusting the number of Agent instances according to the number of requests. It can even scale down to 0 when there are no requests. In addition to this horizontal elasticity, openYuanrong also has unique vertical elasticity, which can dynamically adjust the container specification of each instance according to the actual resource requirements of the Agent. This can not only elastically support the fluctuation of application request loads but also achieve the dynamic and efficient utilization of resources by each instance, thus completely eliminating the trouble of how to configure resources for Agent applications.

In addition, openYuanrong also has very important dynamic scheduling capabilities, which can support Agents to dynamically launch new subtasks/sub - Agents during operation, and even launch multiple subtasks/sub - Agents concurrently to achieve distributed parallel processing. This is very suitable for some of the latest Agent scenarios such as Agent Swarm.

Solving the insecurity problem of Agents

openYuanrong supports multi - tenancy and security isolation. By cooperating with the underlying K8s, etc., it can schedule instances to various different containers for operation as needed. In the Agent scenario, combined with its own dynamic scheduling capabilities, openYuanrong can schedule truly risky code such as AI - generated code to an independent secure container for operation, completely isolating it from the container where the Agent's main body runs, thus avoiding the leakage of privacy such as large - model access credentials caused by the mixed use of the same container.

Support for long - term sessions of Agents

openYuanrong supports the scheduling and long - term operation of stateful instances, thus meeting the Agent's own state access requirements. At the same time, in the multi - round session scenario, openYuanrong can support request routing based on session context affinity to ensure context consistency in the multi - round session scenario. In addition, through openYuanrong's data system, Agents can support real - time distributed backup of their own states, thus ensuring that even in case of failure, the state can remain consistent after instance recovery, ensuring breakpoint - continued execution with consistent semantics and achieving the final correct result output.

In addition to well matching these characteristics of Agents, openYuanrong also provides capabilities such as heterogeneous computing power support, which can schedule Agents, large - model inference services, Agentic RL and other loads in the same cluster to achieve efficient collaboration and fully share and utilize various computing power resources on the cluster.

Ray

Like openYuanrong, Ray[2] is also one of the few systems in the industry with mature task - level dynamic distributed scheduling capabilities, so it can meet the needs of dynamically launching subtasks during Agent operation. At the same time, Ray's Actors are also stateful, which can meet the requirements of long - term stateful operation of Agents.

However, Ray has been more used in offline distributed computing scenarios in the past. When supporting online service - type applications, more work may be needed in aspects such as request access. In addition, there are still relatively many deficiencies in security isolation, multi - tenancy, elasticity, etc., which make it still difficult to well solve the problems of Agent security and efficient resource utilization at present. Therefore, it may not be suitable for directly supporting large - scale enterprise - level Agent online applications.

Anthropic Managed Agents

During the process of writing this article, the author also noticed a latest article about Managed Agents by Anthropic[3]. In this article, in addition to the concepts of Harness, Tool, etc. previously proposed by Anthropic, new concepts such as Session and Sandbox are also clearly proposed, and it is clearly stated that these concepts should be decoupled from each other to better meet considerations such as fault tolerance and security.

Although the perspectives of thinking about problems are slightly different, the ideas of separating Sandbox from Harness, Many Brains, and Many Hands are very consistent with the views in this article. For example, separating Sandbox from Harness is exactly to solve the