After enhancing the credibility of the Agent, will enterprises gain a new batch of useful "digital employees"?
As AI technology evolves from "toolization" to "autonomization", agents are becoming an important form for enterprises to apply large models. So, how can we optimize agents to make them more trustworthy and user - friendly, and ultimately enable them to become excellent "digital employees" for enterprises?
Recently, the live - streaming program "Geek Gathering" of InfoQ in collaboration with AICon specially invited Mark Wei, a senior application support analyst at RBC as the host. Together with Wang Yunfeng, CTO of Smzdm Technology; Lu Fei, senior technical director of the Large - scale Device Business Group at SenseTime; and Wu Haoyu, senior technical director of Minglue Technology Group, they discussed how to enhance the "trustworthiness" of enterprise agents on the eve of the AICon Global Artificial Intelligence Development and Application Conference 2025 Beijing Station.
Some of the wonderful viewpoints are as follows:
- In the future, the traditional software interface may no longer exist, and the UI may completely disappear. Instead, agents will directly interact with the system.
- The value of protocols lies in enabling every role in the ecosystem, including manufacturers that provide brains, data, tools, or execution capabilities, to communicate in the same language. This allows everyone to focus on their own professional fields without spending a lot of time on adaptation work.
- Many models have a forgetting problem in the middle part of long texts. Searching for a needle in a haystack is an unrealistic task. Although many claim to have an extremely long context window, the truly effective part is actually not that large.
- The real difficulty lies in how to unify and align the business model with the technical model. Parameter alignment at the pure technical level is actually the least thing to worry about, after all, everyone is writing programs.
The following content is based on the live - streaming transcript and has been abridged by InfoQ.
Defining the Technical Boundaries of Agents
Mark Wei: Many people think that an agent is just a chatbot with a few plugins. But from a technical architecture perspective, when the system goal changes from "dialogue" to "action", what do you think is the biggest qualitative change in the technology stack?
Wang Yunfeng: I think a chatbot is just an interaction form. In the past, people mainly relied on clicks on the Internet. Later, when AI gained the ability to have dialogues, people started to interact with it in the dialogue box. Then, end - to - end voice emerged, giving AI stronger multimodal capabilities and making the dialogue more natural and powerful. Subsequently, the rise of agents further expanded its operable scope.
A chatbot is just an interface, and the key logic lies in the large model behind it. We often compare the large model to "a brain", while the traditional chatbot only allows users to extract knowledge from it through simple interactions. However, a smart brain needs a peripheral system, just as a person needs eyes, a nose, and a sense of touch to perceive the world and hands and feet to perform operations.
The complete process includes: the model receives a task, determines the action to be taken, perceives the outside world, receives feedback, and continuously adjusts the plan based on the feedback. This is significantly different from the previous simple chatbot model. Its technical complexity and requirements for the ecosystem are much higher than those of the dialogue system.
Lu Fei: The core difference between AI aimed at dialogue and AI aimed at action is that the former focuses on the process, while the latter focuses on the result. Many programmers should remember the state of GitHub Copilot when it first came out without the agent mode. At that time, it only had a code - completion function, which was essentially a chatbot. The common process was: the model completed the code. If the result was not satisfactory, the programmer would give feedback to the model and ask it to modify. After the code execution reported an error, the programmer would paste the error back and let the model continue to modify, repeating this process until it ran successfully.
Later, the agent mode emerged. In essence, it is still the same thing, but the agent can automatically execute the entire process. Previously, the planning was done by humans, who needed to maintain the task steps in their minds, judge the next step to switch to testing, and manage a series of context - related tasks such as returning to coding based on the results.
The emergence of agents has incorporated these processes into the system: the agent will plan on its own, call tools, and manage the context. The core is that the model has stronger memory and context - management capabilities, transferring the short - term, medium - term, and long - term memory and state - switching capabilities that were previously maintained by humans to the inside of the agent.
Therefore, an agent can work continuously in a cycle for dozens of minutes or even days, and always know what it has done, what it is doing, and what it will do next. This reflects a significant qualitative change in the current agent technology stack.
Wu Haoyu: This is also related to the recent popularity of the Doubao phone. The phone jointly launched by Doubao and Nubia can be directly controlled by AI, but platforms such as WeChat and Taobao later prohibited it from logging in. I tried it for a while, and it is really powerful. It is no longer in the form of dialogue but can gradually complete operations on the phone according to your instructions.
This means that AI has the ability to plan and execute tasks, which is also very important in the enterprise scenario. For example, when we ask AI to judge the popularity of a certain topic, it should not just simply search and answer, but should plan the whole set of steps, including retrieving, aggregating relevant words and posts, performing sentiment clustering, and generating reports. This is completely different from the past, when it only did Q&A or simple text analysis, and the requirements for planning and scheduling capabilities have increased significantly.
Secondly, after the system has the "ability to act", its permissions and responsibilities also expand. AI can access the phone album and chat records; in the enterprise, it may access work software and databases. This means that the system must have the ability to be traceable and intervenable, and clear security boundaries should be set, otherwise, the behavior will be uncontrollable. Therefore, in the agent architecture, we add a large number of monitoring and verifiable mechanisms as well as manual closed - loop control, which is also one of the biggest differences from the previous chatbot model.
Mark Wei: Currently, agents often "become stupid" or "get stuck". From the three aspects of computing power supply, data supply, and protocol interaction, which aspect is the current "weak link"? Is it that the inference speed is too slow to keep up with the thinking, or is it that the short context memory limits the logic?
Lu Fei: More precisely, it is the shortage of cost - effective computing power. From the perspective of actually implemented agents, the problem often lies not in the absence of computing power, but in the trade - off between cost and effect. In many application scenarios, flagship models are not used, but smaller models such as 30B or even 7B are selected. Although these models may support a context window of 100K or even 200K, in actual use, the context is still limited to 32K or smaller, essentially to reduce costs. Similarly, we also limit the number of rounds of in - depth thinking of the agent. For example, in Cursor, when the Max mode is enabled and the best model is used to generate a feature, it can exhaust the monthly quota in twenty minutes. If more cost - effective computing power is available in the future, the existing top - notch models and algorithms can fully exert their capabilities in a wider range of scenarios.
Wu Haoyu: The data quality of the context is also extremely important. Even if the context is long, if the information quality is low and there is a lot of noise, the task planning and results output by the model will still not be ideal. When we do public opinion analysis, we often use posts from platforms such as Xiaohongshu and Weibo, but the information density of such content is generally low. If we directly throw ten thousand posts to the large model for summarization, the viewpoints and facts obtained will either be incomplete or deviate. Therefore, we usually pre - process the data before giving it to the model to generate a summary or report.
In addition, the context of an agent often comes from previous multiple rounds of interactions, and some of this information is useful, while some are just ineffective attempts. Although there is already context compression technology, most of it is passive: compression is only carried out when approaching the window limit. In fact, we need to compress more frequently to make the retained information more dense, real, and reliable, thereby improving the agent's planning ability. Therefore, to make the agent run better, credible, high - density, and high - quality data must be provided.
Wang Yunfeng: As the model becomes stronger and the context window becomes larger, what really determines the effect of the agent is often not the model, but the quality of the private data within the enterprise. The model can be selected. If it doesn't work, we can choose a better one or improve the effect through fine - tuning or distillation. However, the difficulty of data pre - processing is much greater than that of selecting or fine - tuning the model. We must admit that unprocessed data is completely unusable, and the construction of high - quality data is very difficult.
Even if the context window is getting larger, if the input content is too long, the probability of the model making errors will still increase. For example, in some scenarios where it is necessary to understand user needs, ten thousand pieces of content are far from enough to form a reasonable judgment. A reasonable sampling scale should be one hundred thousand or even one million pieces. However, when the data volume reaches the million - level, the result of directly inputting it into the model is almost unusable. And as the processing chain becomes longer, even if the usability of each link is 90%, after ten links, the overall usability will drop to an unacceptable level.
The completion of an agent requires the joint collaboration of multiple links and multiple modules. The brain is just one of the modules, and a large amount of data is also needed. This data includes both enterprise private data and professional information in areas such as finance, law, and privacy protection. In the future, each type of ability or module may be provided by different manufacturers. For example, a group of manufacturers focus on providing large models to make the "intelligent brain" stronger, while data providers ensure that the data is real, reliable, and has a high information density, and some data comes from real - time perception.
In such a multi - party collaborative system, the importance of the protocol becomes prominent. No single manufacturer can complete all the work, and the ecosystem necessarily requires many participants. If we need to re - adapt every time we call external data or tools, the efficiency will be greatly reduced. Therefore, the value of the protocol lies in enabling every role in the ecosystem, including manufacturers that provide brains, data, tools, or execution capabilities, to communicate in the same language, allowing everyone to focus on their own professional fields without spending a lot of time on adaptation work.
Mark Wei: If the future is a world of multi - agent collaboration, agents need a standard for communication. Mr. Wang is promoting the MCP (Model Context Protocol). What do Mr. Lu and Mr. Wu think? In 2026, will the inter - agent connection protocol become open - source and unified, or will it be dominated by large manufacturers?
Lu Fei: The future will definitely be a world of multi - agent collaboration, and the relationship between agents will be far more complex than it is today, showing a multi - to - multi and open - ended interaction. Therefore, a unified agent interaction protocol is particularly important. I personally firmly believe that the protocol will ultimately become open - source and unified, moving towards a neutral, open, and autonomous state, and the speed is likely to be very fast.
We can look back at the history of the Internet. For example, the TCP/IP protocol and the OCI competed for more than a decade. Eventually, TCP/IP was handed over to the IETF for maintenance, forming a neutral governance model. During this period, both hardware manufacturers and software developers faced the dilemma of having to adapt to two sets of protocols simultaneously. The HTTP protocol also went through a similar process, taking a long time to become open - source and self - governing. More recently, protocols such as Kubernetes and gRPC entered the neutral governance stage in about two or three years. The same is true for the MCP. I remember that recently, Anthropic just donated the MCP protocol to the AIF, and OpenAI, Google, and Microsoft are all its members. It has only been about a year since the MCP was first released.
In the current technological environment, major manufacturers are fully aware that by embracing open - source, jointly building the ecosystem, and avoiding protocol wars, they can provide stable expectations for developers and enterprises, allowing them to build systems based on the MCP ecosystem with confidence without worrying about being locked in by a single manufacturer.
Wu Haoyu: Major manufacturers have shown great support for the MCP. Although it has only been around for a year, it has demonstrated strong advantages, and almost all manufacturers have connected to it, making it the de - facto standard for multi - agent communication. In addition, Anthropic has also launched many new features on top of the MCP. For example, the standard MCP requires sequential calls and waiting for responses, while Anthropic's recent PDC protocol combines multiple MCP calls into one through code. Our test results are consistent with Anthropic's official conclusion: this method can reduce the context length by 80% or more.
Therefore, even if the underlying protocol is unified, the upper - layer ecosystem will continue to innovate. Especially in the era of large models, many new protocols and ecosystems that have never been seen before may emerge. If the underlying protocol is stable and reliable, the innovation space of the upper - layer ecosystem will be greater. For application manufacturers, exploring new capabilities and features based on the protocol is not only an opportunity but also helps us better serve enterprises and users.
Dissecting the Agent Architecture Layer by Layer
Mark Wei: Mr. Lu, the biggest obstacle to the implementation of agents in enterprises is often the cost. The operating mode of agents determines that they need to maintain an extremely long context, which consumes a large amount of video memory and bandwidth. At the large - scale device level, do you have any special optimization solutions for the "long - range inference tasks" of agents?
Lu Fei: As the running time of a single agent task continues to increase, the context will expand significantly, which not only affects performance but also significantly increases costs. To solve the context problem in long - range inference, there are currently several methods. The most basic one is context compression, including summarization and structured compression. Another type is the persistence of long - term memory, that is, storing high - value and high - information - content content in external storage, such as a Vector DB or a knowledge graph, to achieve the transfer of high - value information across sessions. These all belong to the category of "context engineering" and are essentially effective means to improve information density.
In addition, we also optimize the KV Cache. For example, we use CPU memory or even SSD for hierarchical storage to improve system throughput. At the same time, we can perform different levels of quantization on the KV Cache. However