HomeArticle

Embracing the Middle - layer Force in the AGI Era: Opportunities and Challenges of AI Middleware

极客邦科技InfoQ2025-08-05 17:51
In recent years, the most remarkable progress in the field of artificial intelligence has been the rapid development of large models.

Development Trends of Large Models:

From Capability Leap to Ecosystem Openness

In recent years, the most remarkable progress in the field of artificial intelligence has been the rapid development of large models. With their astonishing learning and generalization abilities, these models are profoundly changing our perception of AI and propelling the dream of AGI (Artificial General Intelligence) from science fiction to reality. The development of large models shows two core trends: the continuous leap in model capabilities and the increasing openness of the model ecosystem.

1.1 Continuous Leap in Model Capabilities

The improvement of large model capabilities is not achieved overnight but gradually realized through continuous iteration and technological innovation. From the initial text generation to today's multi - modal understanding and reasoning, large models are showing an intelligence level increasingly close to that of humans. Taking OpenAI's ChatGPT series as an example, its evolution path clearly depicts this process of capability leap:

Significant Enhancement in Language Ability: From GPT 3.5 to GPT 4, the model has made a qualitative leap in language understanding, generation, and logical reasoning. GPT - 4 shows far superior capabilities to GPT 3.5 in handling complex problems, generating high - quality text, and performing multi - language translation. This enables large models to understand context more deeply and generate more coherent and accurate content.

Breakthrough in Multi - modal Capabilities: With the release of GPT 4o, large models are no longer limited to text and begin to natively support the input and output of any combination of text, audio, images, and videos. This means that the model can process and understand multiple types of data simultaneously. For example, through text conversations, uploading pictures or audio, the model can process this information at the same time, thus achieving a more natural and rich interactive experience. This multi - modal capability greatly expands the application boundaries of AI, enabling it to better perceive and understand the real world.

Deepening of Reasoning Ability: The o1 model launched by OpenAI further emphasizes the reasoning ability of large models. Through reinforcement learning training, the o1 model can "think" before answering questions and generate internal thought chains, thus performing more complex reasoning tasks, especially excelling in programming and mathematical reasoning. This marks that large models are moving from "fast thinking" based on knowledge memory to "slow thinking" with in - depth logical analysis, enabling them to solve more challenging problems.

Expansion of Tool - using Ability: With the launch of the o3 model, large models begin to have the ability to autonomously call and integrate tools. This means that the model can not only understand problems but also independently select and use external tools (such as web search, code executors, data analysis tools, etc.) to solve problems. This ability enables AI Agents to interact with the environment at a deeper level, thus realizing more complex task automation.

In addition to the OpenAI series, other leading large models also show strong capabilities in their respective fields. For example, Google's Gemini model is known for its powerful multi - modal reasoning ability, capable of simultaneously understanding and processing various data forms such as text, pictures, and voices, and excelling in complex coding and analyzing large - scale databases. Anthropic's Claude Sonnet 4 performs excellently in programming and reasoning and is considered one of the top programming assistants today. The continuous emergence and improvement of these models make the dream of AGI no longer distant. Different from Narrow AI (such as IBM Watson, DeepBlue, Google AlphaGo) that focuses on specific fields, with the support of LLM (Large Language Model), AI Agents have more generalized understanding, reasoning, and planning abilities, capable of solving more general problems and are expected to continuously evolve towards general intelligence.

1.2 Increasing Openness of the Model Ecosystem

Parallel to the development of proprietary models (such as OpenAI's closed - source models) is the booming rise of open - source large models.

The Rise of the Open - source Wave: Starting from Meta's release of the LLaMA series of open - source models, many domestic and foreign teams have successively launched high - quality open - source models such as QWen, DeepSeek, Kimi, and Mistral, making large - model technology no longer exclusive to a few technology giants. These open - source models not only provide powerful basic capabilities but also allow developers to freely access, use, and fine - tune them, greatly reducing the threshold for AI development.

The Catch - up of Open - source Model Capabilities: Notably, the capabilities of some open - source models are rapidly approaching and even surpassing proprietary models in certain specific tasks. For example, open - source models such as DeepSeek R1 and Kimi K2 show remarkable performance in reasoning ability and code generation. This trend makes high - quality AI capabilities no longer exclusive resources of a few giants, and all industries can obtain powerful model capabilities at low cost.

This trend is driving the AI application into a full - blown outbreak period. Just like the innovation energy released after Linux broke the monopoly of the operating system, the open large - model ecosystem is giving birth to a rich variety of intelligent applications and injecting strong impetus into the intelligent transformation of industries.

Evolution of AI Applications:

From Chatbots to Organizational - level Intelligent Agents

2.1 Evolution Path of AI Applications

The rapid development of large - model capabilities has directly promoted a profound transformation in the form of AI applications. OpenAI pointed out a path to AGI in an internal meeting (as shown in Figure 1), which provides an insightful framework for us to understand the evolution of AI applications [1][2]:

Level 1: AI with conversational language capabilities: At this stage, AI is mainly manifested as chatbots, capable of smooth text conversations, understanding and responding to user instructions. The early ChatGPT is a typical representative of this stage.

Level 2: AI with human - level problem - solving abilities: At this stage, AI begins to show stronger reasoning abilities and can solve complex mathematical and logical problems. They are no longer just information retrieval tools but "reasoners" capable of in - depth thinking and analysis. DeepSeek R1 is a typical representative of this stage.

Level 3: Systems that can take actions on behalf of users: AI at this stage is called "agents". They can not only think but also interact with the external environment by calling tools and complete tasks autonomously. For example, through tools such as code executors and browsers, AI can perform a wider range of operations. Recently popular agents such as Manus and Claude Code meet the definition of this stage.

Level 4: AI that can aid in invention and discovery: AI at this level can perform deeper creative work and assist humans in scientific research, discovery of new materials, etc.

Level 5: AI that can perform the work of an entire organization: This is the ultimate goal of AGI. AI can operate like a complete organization, autonomously complete various business processes, and achieve full - scale intelligence.

Image source: https://www.linkedin.com/posts/gusmclennan_openai-agi-aiprogress-activity-7238696300790038530-rmjk/

Currently, the development of AI applications is progressing steadily along this trend. From the initial ChatGPT chatbot to the subsequent ability to conduct online searches, then to in - depth research through thinking and multi - round retrieval, and the recent emergence of various Agent applications, all confirm this evolution path.

2.2 Explosion of AI Agents

In the past six months, the field of AI Agents has shown explosive growth, with a large number of general - purpose and vertical - field intelligent agents emerging:

General Agents: Such as Manus, Genspark, ChatGPT Agent, etc. They aim to solve a wider range of general problems and provide users with one - stop services by integrating terminals, browsers, computers, and other tools. These general agents show great potential in handling daily tasks, information queries, content creation, etc.

Professional Agents: In specific fields, a large number of highly professional agents have emerged, such as coding agents like Claude Code, Gemini CLI, Qwen Code, and AI coding IDEs like Cursor, Trae, Kiro. They can assist or even autonomously complete tasks such as code writing, debugging, and testing, greatly improving development efficiency.

The core difference between these AI Agents and other AI applications is that they have learned to use tools and can interact with the environment (such as terminals, browsers, computers). Behind this is the autonomous learning driven by Reinforcement Fine - Tuning (RFT), enabling the model to master how to effectively use these tools to solve problems.

It is worth mentioning that these agents still maintain "Human in the Loop" during execution. For example, ChatGPT Agent will request user confirmation before performing operations that may have important impacts (such as placing an order for purchase), and Claude Code will also stop and let users review when executing risky terminal commands to ensure safety and controllability.

2.3 Coexistence and Complementation of General Agents and Vertical Agents

As the capabilities of large models increase, a question arises: In the future, will only a few general agents be able to handle all tasks? Or will different industries still need their own vertical agents?

There is no definite conclusion in the industry yet. However, many practitioners tend to the latter, that is, vertical - field intelligent agents still have their irreplaceable value. The reason is that business scenarios often require in - depth integration of domain knowledge, proprietary data, and specific tools, which belong to external knowledge and interfaces of the model and need to be integrated and optimized at the Agent level. Take an enterprise's intelligent customer - service agent as an example. It needs:

Deep business knowledge (External Knowledge): To accurately understand the company's product manuals, service terms, and business processes.

Personalized user memory (Memory): To know the user's historical orders, service preferences, and communication habits.

Proprietary business tools (Tool): To be able to call internal APIs such as order query, refund processing, and logistics tracking.

This context information deeply bound to the business scenario is difficult for general agents to match. At the same time, the training cycle of the basic model is long and the cost is high, which cannot keep up with the rapid changes in business. Therefore, on top of the powerful basic model, building a layer of vertical agents that can deeply integrate business knowledge, data, and tools will be an inevitable choice for the implementation of enterprise AI applications in the future. It can be predicted that general agents and vertical agents will coexist and complement each other for a long time in the future: the former solves common problems, and the latter addresses the long - tail needs of industries.

In the more distant future, agents with embodied intelligence may also appear, that is, endowing AI with more sensory and action capabilities in the physical world. Beyond text, voice, and images, researchers are trying to connect intelligent agents to sensors such as smell, taste, and touch, and influence the real environment through robotic arms, robots, and other tools.

The evolution of AI applications is essentially an interactive process between the model and the environment (browsers, code, APIs, the physical world). This process depends on the improvement of model capabilities but also faces a series of engineering challenges such as Agent R & D, multi - agent collaboration, RAG effectiveness, model hallucination, and tool use. The core of solving these challenges is AI middleware.

Opportunities and Challenges of AI Middleware

In the era of distributed systems and cloud - native, middleware has greatly improved software R & D efficiency by shielding the underlying complexity and providing standardized interfaces. Similarly, in the AI era, the emerging AI middleware is playing a similar role - as the "middle layer" connecting the basic large model and specific applications, it provides developers with a series of basic capabilities and frameworks required for building intelligent applications. In this part, we will discuss the opportunities contained in AI middleware and the challenges it faces in the implementation process.

3.1 Opportunities for AI Middleware

3.1.1 Efficiency Improvement in Agent R & D

Developing a fully - functional AI Agent involves many aspects such as model calling, vector retrieval, prompt design, tool integration, and dialogue management. AI middleware can provide a one - stop Agent R & D framework, modularize and standardize these commonly used functions, and significantly lower the development threshold. For example:

Abstract and encapsulate the underlying LLM to facilitate the switching of different models.

Provide ReAct templates to support chained thinking of alternating reasoning and action.

Seamlessly integrate RAG (Retrieval - Augmented Generation), short -/long - term memory libraries, and various external tool plugins.

In addition, considering that the operation of Agents is usually event - driven, highly concurrent, and with uncertain single - time consumption, introducing a serverless architecture (Serverless/FaaS) as the runtime of Agents will be very beneficial. In this mode, computing power instances are automatically scheduled to execute Agents when there are task requests, and resources are released when idle, which can be elastically scaled and reduce operation and maintenance costs.

Moreover, as Agents become more and more complex, evaluation and testing also become important. Middleware has the opportunity to provide an Agent Evaluation framework similar to unit testing (UT) or integration testing (IT), simulate various environmental feedback to verify the decision - making and output quality of Agents, and form a closed R & D loop.

Overall, providing support for the entire lifecycle of Agents (development - deployment - monitoring - evaluation) will be a stage for AI middleware to showcase its capabilities.

3.1.2 Context Engineering

Building an AI Agent is largely about engineering the management of context. The context of an Agent is usually composed of various elements [3]:

Instructions (indicating roles and responsibilities)

Examples (few - shot examples for in - context learning)

External Knowledge (business knowledge or facts injected through retrieval)

Memory (conversation history, user preferences, etc.)

Messages/Tool Results (user input, tool call results, etc.)

Tool Descriptions (tool descriptions)

How to efficiently assemble this rich information into prompts is a new engineering discipline.

AI middleware can play a significant role here. On the one hand, it can provide context templates and orchestration tools to automatically splice out the optimal prompt combination according to different scenarios. On the other hand, it can cache and optimize the context according to the characteristics of the model's attention mechanism. For example, the experience shared by the Manus project is to keep the prompt prefix stable as much as possible to use KV - Cache for speed - up, and only incrementally add new content in each interaction [4]. As shown in Figure 2, in multi - round conversations, if the instructions and existing conversations at the beginning of the context remain unchanged (the Cache Hit part), the model only needs to calculate the attention for the newly added segments, thus significantly reducing the inference overhead at each step.

Image source: https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus

In addition, the limitation of the model's maximum context length also needs to be considered. Although new technologies (such as the NSA mechanism proposed by DeepSeek) are pushing the context window to the million - token level, the computational overhead of the attention mechanism determines that the context cannot grow infinitely. Therefore, we also need to implement context compression strategies, such as summarizing long - term historical conversations, indexing and referencing unchanged knowledge content instead of embedding the full text, or introducing hierarchical memory so that the Agent can query long - term memory by itself when necessary instead of fully integrating it into the context every time.

3.1.3 Memory Management

Memory is a key aspect of human intelligence, and a similar memory module also needs to be built in AI Agents. AI middleware can provide convenient short - term and long - term memory functions:

Short - term memory mainly refers to the information retention of the Agent during a single conversation or task, such as multi - round conversation content, the list of objects to be focused on currently, and the results of used tools.

Long - term memory is the persistent memory across conversations and tasks, such as user preferences, business knowledge bases, and historical decision - making experiences. For example, Anthropic's Claude Code uses the CLAUDE.md file as project memory and automatically loads it at the beginning of each conversation, which records information such as the code structure, naming style, and commonly used commands of the project. In this way, Claude Code always remembers the background knowledge and specifications of the project when writing code, greatly improving the cooperation degree. Similarly, an Agent for the customer - service field may need long - term memory of the user's identity, purchase history, and preferences, while