High-performance data foundation for AI Agents: Architecture and engineering practices
With the rapid development of large models, a new generation of AI-native applications driven by AI Agents is developing rapidly and achieving great success. AI-native applications are based on large models and interact with application data through various Agents to intelligently complete various tasks. Applications driven by AI Agents are developed and iterated rapidly, and they maintain data in multiple modalities simultaneously. The access patterns and traffic of different modal data vary greatly, which poses new challenges to the underlying data platform. What kind of data foundation will future AI Agent-driven native applications need?
At the QCon Global Software Development Conference (Beijing Station) hosted by InfoQ, Chen Liang, the founder and chief architect of Chenzhang Data, gave a speech titled "High-Performance Data Foundation for AI Agents: Architecture and Engineering Practices". He shared his thoughts on the data foundation architecture in the AI era, how to solve the data challenges of AI-native applications through this architecture, and the engineering practices of implementing a high-performance data foundation in the context of cloud computing and new hardware environments.
The following is the transcript of the speech (edited by InfoQ without changing the original meaning).
AI Agent-driven AI-native Applications
Today, AI Agents are leading the transformation of the entire software paradigm. Before the AI era, we talked about SaaS. At that time, software as a tool actually constructed a workflow to help people complete certain tasks in this workflow. When SaaS becomes AI-driven, a paradigm shift occurs. The software becomes more intelligent and turns into an intelligent agent that can perform very complex tasks and even has a certain ability of self-evolution and improvement. From this perspective, it is no longer just a software tool to help people, but an intelligent agent that can directly provide a service.
In the SaaS era, SaaS software has a workflow. Users provide an input, and the workflow helps users complete certain tasks. During the workflow, a lot of data and states are collected, and these states are recorded in a database. In many cases, it is structured data. Here is a significant feature: the first piece of data is generated by the software, or we can say that the data is an output of the software. So in such an architecture, people have some relatively simplified expectations for the data.
The format of the first piece of data is often defined by software developers. Since it is the software I wrote, I can define what attributes the data has, what format it is in, whether it is a table or a graph. This is defined by the developer. At the same time, the data is continuously collected during the operation of the software, which means that the amount of data grows slowly with the scale of the software and user interaction. Generally speaking, it is controllable.
Of course, as the software becomes more complex and more data is collected, the format of the data may eventually become more complex, and more intelligent analysis is required. However, this process is relatively slow and develops as the software becomes more popular and the number of users increases. Maybe many software products are not blockbusters, and their data requirements are not that high.
What changes will occur in the Agent era? First of all, in the Agent scenario, the workflow is no longer just a workflow. Development focuses more on the orchestration of Agents, and there may be many Agents. Of course, the core is the large model. How do we use the large model to drive different Agents? Here is a very different point: when we start developing an application today, we need data. The data may come from a knowledge base or some external structured data. This data is actually the fuel for the Agent, just like a car needs fuel for a cold start. The large model is more like a driving engine. Since the large model can only provide some general knowledge, it is actually difficult to implement some very domain-specific tasks. Therefore, we need a lot of data, especially industry data. However, this data comes from the outside, so the format and scale of the data may not be fully within our control.
The continuous interaction between AI and users will also generate more data, which will also form the underlying data. We have been involved in many Agent development projects and found that they consider data feedback from the very beginning. In other words, we not only collect data but also use it to enrich our knowledge base. Ultimately, what we provide is a service. Users see an overall service, and it is no longer a tool to help people. At this time, it may become an intelligent agent itself.
Let's take a specific example, a financial scenario.
There are four Agents here. One Agent is mainly responsible for market analysis, and another Agent focuses on risk control, etc. From a data perspective, in this app, we may need many different types of databases. We may need user information, which is generally stored in a table. We may have financial reports, which are often semi-structured data, and some structured information can be extracted from them. If the external knowledge base is large and includes a lot of logs, it may also be stored in MongoDB.
People are quite familiar with Pinecone and Elastic because in the era of large models, text is very important. When we talk about text search, in fact, vectors and full-text searches are often done together, and a Ranker may also be needed. Of course, there may also be a knowledge base, and some knowledge bases are represented by graphs. When users provide continuous feedback, the Agent often needs some conversation information, and the latency requirement is very short. Therefore, we need some in-memory databases.
So on the first day of building an Agent application, many types of databases may be involved. And you cannot control the scale of the external data. What if the scale is large? In this case, you need to choose a scalable option. At the same time, from a performance perspective, since some business operations need to interact with users, the latency requirement is naturally very strict. Therefore, there must be something purely in-memory.
The above figure shows a common workflow of an Agent today. It is generally a web service, and users log in. After logging in, you have user information, and at this time, you need to access some relational databases.
There is an important concept in an Agent, which we call the Agent Loop or a cycle. Since the interaction is often not a one-time process and requires continuous iteration, the Agent may call the large model to obtain some information and also make many external calls. The external calls may come from web searches or some external computing services. Of course, there is also a very important method called RAG, which may involve full-text knowledge base RAG.
There are also short-term and long-term memories with different latency requirements. Short-term memory may be stored in something in-memory, while large-scale long-term memory may be stored in a relatively persistent database. Therefore, different types of data require their corresponding databases, and new data will be continuously generated during the application interaction process.
To summarize briefly, the first difference between the Internet era and the Agent era from a data perspective is that in the former, the data is generated by the application, which means the data is controllable. However, in the AI era, it is different because a lot of external data may come in, which is not entirely within your control, and the scale may also be very large. In addition, there will be a large amount of unstructured data in the AI era. Therefore, almost all databases today need to develop vector capabilities because search is becoming increasingly important. The last point is that Agents interact with each other and with the outside world. When interacting, they need to record the content, so the amount of data will accumulate quickly.
Data Challenges Faced by AI-native Applications
From a system perspective, these characteristics of the AI era bring many challenges to database management. The first challenge is that we hope the database can support multiple modalities. The second is that when there are many databases, data synchronization and data consistency always need to be considered. For example, in a chat, there may be short-term memory, which will eventually need to be converted into long-term memory. At the same time, the content output by the application also needs to be fed back to the original various data models, resulting in a data cycle. Thirdly, different databases in the application have different requirements for performance, scale, and other attributes. Finally, there is the operation and management of multiple systems. In today's AI era, we can quickly develop an application. A team of three people may be able to quickly build an app in six months. However, operating the app has become a significant cost because the data needs to be continuously accumulated, which is your core value.
In summary, applications driven by AI Agents face data challenges in the early stage that were only encountered by traditional large enterprises. At the same time, the data flywheel iterates more rapidly in the AI Agent era, increasing the pressure on the database system.
Multi-modal Data Foundation
In this context, our thinking is what we should do and whether there can be a unified data architecture to achieve this. The ultimate direction is a multi-modal data foundation.
We have three design goals. The first is to support multiple data modalities. In the AI era, an application may face the issue of multi-modal support. We would like to emphasize two aspects in particular. First, we hope that its APIs are natively compatible. For example, if I have a JSON API, it should be at least compatible with MongoDB; if I have a SQL API, it should be compatible with MySQL. This is a very important point. Because developers hope that their systems are scalable and migratable. Sometimes they may want to deploy on the cloud, and sometimes they may want to deploy privately. Private deployment may even have many restrictions. Therefore, using standard APIs becomes very crucial. Of course, I can define my own API, but if I want others to build apps on my platform, there will be great risks in the future. The second aspect I want to emphasize is performance. Performance and cost are always long-term considerations. I think this may be the most critical point and the most important thing for system developers. When users make a choice, if your system is slower than others, the argument that your value lies in multi-modality becomes very weak.
The second design goal is dynamic scaling and automatic management, which is in line with the current trend of cloud-native technology.
The third goal is cross-modal access and consistency. There is data synchronization between modalities and consistent access. I don't want, for example, eight databases to operate independently. Eliminating the barriers between databases is a very important point in multi-modality. I don't need a middleware or a proxy to connect all the databases and then provide a unified interface. This doesn't make much sense because it doesn't reduce the management cost. At the same time, in some applications, from long-term memory to short-term memory, and from the results back to the knowledge base, there is often cross-modal access.
Before talking about the multi-modal data architecture, let me briefly review the evolution history of the database architecture.
Databases were actually just one machine in the early days. Later, the evolution of databases branched into OLAP and OLTP. In the cloud era, the shared-nothing architecture of OLAP databases was not very ideal, so a new approach called separation of storage and computing emerged. OLTP also had a shared-nothing architecture at the beginning, but its evolution was not so simple. Because there is a significant difference between TP and AP. In AP, memory caching is not very important because it needs to scan a large amount of data continuously, and the memory definitely cannot hold it all. Eventually, it always has to scan the disk. However, online TP is different because it requires millisecond-level latency, so memory caching is very important and is the most important means to ensure low latency. In the shared-nothing architecture, the computing and caching are not in the same place, and each access needs to go through the network. Therefore, in the Agent era, it is difficult to guarantee the low latency. So the industry proposed the Aurora architecture, which moves the cache up so that the computing and caching are in the same place, and there is a storage below, which we call the shared storage architecture. In this way, the low latency can be guaranteed.
Our idea is that there will be a data base layer between computing and storage. In this base layer, I emphasize caching the most. Caching is the most important thing to ensure the low latency of online databases. Moreover, for all modalities of online data, whether it is vector search, full-text search, or graph processing, the most important thing is caching. It is difficult to guarantee the latency for things that are not in the cache.
In our data foundation, there will be computing on the top and storage on the bottom. The computing engine can be replaced, and any type of engine can be used. We can be natively compatible with many open-source components. In the middle, there will be a data base layer that abstracts some of the most core functions of online databases, and the most important one is caching. At the same time, we hope to use a unified cache format to bridge or eliminate the barriers between different data modalities, so that data can be accessed efficiently in a unified or cross-modal manner.
In addition, in our design, caching and computing are logically decoupled but physically coupled. Physical coupling means that the cache is placed in the local memory of the machine as much as possible to reduce cross-network reads. Logical decoupling is used to eliminate the differences between different modalities and achieve the same performance as the native system through caching.
When it comes to caching, people may first think of a traditional hash table. However, since our data is designed to support different modalities, we need to support more complex data structures, which change with the modalities. Therefore, the design of the cache is not a simple key-value mapping. We hope to have a design that can support more complex data structures.