How to prepare your data for artificial intelligence
Agent artificial intelligence is disrupting the big data paradigm. It requires us to actively introduce data into specialized intelligent computing platforms, rather than the other way around. This shift fundamentally changes our inherent understanding of data modeling and storage because low-level machine learning (LLM) can perform context learning using datasets much smaller than those in traditional machine learning. Therefore, the continuously expanding context windows and tool invocation capabilities of modern artificial intelligence are rapidly rendering many traditional ETL/ELT processes obsolete, forcing data engineers to completely rethink their entire approach.
What causes this chaos?
One of the reasons for this shift is the changing way people use data.
Enterprise applications and dashboards are built by software engineers and data scientists to meet the needs of non-technical users. In turn, business analysts and end-users passively receive this content. Applications may have some built-in interactive features, but these interactions follow rigid, pre-set workflows. As data engineers, our job is to provide data formats that such applications can use.
Transition from a "builder"-centered model (where technical users create applications) to an "interactor"-centered model (where non-technical users directly interact with data through artificial intelligence agents).
More and more non-technical users are directly interacting with data. They can write applications and tools according to their own needs. Existing SaaS applications are no longer limited to integrating side-by-side chat interfaces but are using frameworks like CopilotKit to more natively embed natural language interactions. Forward-looking developers are not simply repeating rigid workflows but are embedding AI agents into applications, enabling the agents to access backend APIs in the form of tool invocations.
Secondly, the focus is shifting. In the past, the data volume was huge, so computing resources needed to be deployed where the data was located to avoid large-scale data migration. However, today, cutting-edge artificial intelligence models (LLMs) are the focus, and artificial intelligence applications are built around them.
The focus has shifted, so the technical architecture has also flipped. Different from the past when customized computing resources were needed to process data, agent artificial intelligence applications use large language models (LLMs) as inference engines, which can understand user intentions, infer tasks, and invoke tools to perform operations. This new wave of applications aims to directly translate user intentions into actions.
How do these two dynamic changes affect the work of data engineers? The following five principles should be kept in mind when preparing data for artificial intelligence.
1. Rethink ETL/ELT: From normalization to context
Today, data engineers invest a lot of effort in data normalization, creating clear data schemas, and building transformation pipelines. The goal is to enable downstream analysts and applications to understand the data.
This doesn't mean that ETL/ELT becomes irrelevant. Providing data is still crucial. But you can rely on agents to interpret schemas, understand relationships, and process data in various formats without extensive preprocessing.
However, simply adding a data catalog and an MCP server to existing tables is a great waste of the capabilities of agent technology. Moreover, it will greatly increase the workload of AI agents. Why?
Artificial intelligence agents can understand data in context. They don't need all data to be pre-normalized into rigid schemas. In fact, as the number of tables grows, it's difficult for today's agents to correctly interpret the data and write correct SQL statements to join all the data. Additionally, as the number of data slices increases, the probability of conflicts and ambiguities also increases. For example, two tables may both have a "loan amount" column. In one table, it may represent the amount applied for by the borrower, while in the other, it may represent the principal actually disbursed by the lender. The more processed, normalized, and decentralized the data structure is, the more difficult it is to convey context information.
Maintain the data availability workflow, but question whether each normalization step is still necessary. Can the agent understand the data in the appropriate context without transformation? Can the principal information be extracted from the original term sheet or financing memorandum to explain what payments the principal will receive in installments, rather than just being represented by a single number?
Avoid the temptation to only open unstructured data to AI agents. Although it's easy to process PDFs, emails, etc., the truly actionable data in an organization is usually still structured data.
2. Prioritize data curation over data collection
Contextual learning makes content curation more important than data collection.
In the era of big data, the goal was to collect as much data as possible because you wanted to train machine learning models on extremely large datasets. More data meant better machine learning models.
However, artificial intelligence agents are built based on contextual learning, which provides one or two examples in the prompt. Learning learning models (LLMs) can effectively imitate these examples, whether following a certain process (chain of thought) or a certain format or style (few-shot learning). With the emergence of contextual learning, the quality of examples is more important than the quantity.
The example data you show to the agent will affect its understanding of all similar data. You may create a library of examples and select which examples to use for specific types of user queries. As the importance of data management becomes increasingly prominent, as a data engineer, it becomes crucial to build the following tools:
• Find the highest-quality data, such as complete, accurate, and representative data samples.
• Regularly update these examples as the standards evolve.
• Verify that the carefully curated data can indeed serve as effective examples for agent learning.
As a data engineer, one of the key roles you need to empower is the data administrator. The types of storage you need to support will also change, including graph databases and vector databases.
3. Build agent-oriented infrastructure: Perception and action
Artificial intelligence agents need infrastructure that supports two core capabilities: perceiving data and taking action based on the data.
Not all data formats can be equally accessed by language model-based agents. Consider how easily agents can parse, understand, and extract the meaning of data formats. Formats that can retain semantic meaning and require minimal preprocessing can reduce interaction resistance.
AI agents perform operations by invoking tools (including functions, APIs, and services), which enable them to process data. Your infrastructure needs to ensure that agents can discover and use these tools. This means clear interfaces, comprehensive documentation, and reliable execution.
Review your data access patterns and tools from the perspective of artificial intelligence agents. What information does an autonomous system need to know to use them effectively? Which links have obstacles that cause poor operation?
4. Manage agent artifacts as first-class data
Artificial intelligence agents not only consume data but also generate data. In fact, you'll find that the content generated by artificial intelligence will far exceed the amount of "original" data in the system.
When agents generate outputs, make decisions, write code, or record their reasoning processes, these also become data.
Whether the content is created manually, collected from software systems, or generated by artificial intelligence agents, it must comply with the general specifications and regulations of your industry. In addition to compliance, the data generated by these agents is also valuable for debugging, auditing, training future agents, and understanding system behavior.
The data generated by agents should be treated as strictly as other data:
• Store the agent output system
• Retain decision logs and reasoning traces
• Manage the code generated by agents as versioned artifacts
• Ensure that this data is available for analysis and future training
These artifacts will become part of your data ecosystem. Design storage and access patterns accordingly.
5. Connect observation and training
The fastest way to improve agent performance is to implement a closed loop between observability and training. The artificial intelligence agent infrastructure needs a two-way pipeline to connect model performance and observability with continuous training.
First, you need an observability platform that can track data quality indicators. In particular, it should be able to detect data drift (changes in input data features) and concept drift (changes in the relationship between inputs and outputs). At the same time, it must monitor key model performance indicators, such as accuracy, latency, and hallucination rate. Establish automatic triggers associated with predefined thresholds.
Your observability platform also needs to be extended to incorporate human feedback. Every modification made by users to the generated content needs to be recorded and used to improve the artificial intelligence model.
Second, you need a retraining process that will be automatically activated when an event triggered by the monitoring system is received. It must be fully automated, responsible for extracting the latest version of the training data, initiating model retraining or fine-tuning tasks, and conducting comprehensive evaluations and regression tests on the newly trained model. In the era of agents, building this closed-loop system that directly connects performance monitoring to automated deployment is crucial for machine learning/data engineers, and the boundary between the two will become increasingly blurred.
How does the role of data engineers change?
These five shifts all revolve around a common theme: from rigid, pre-set workflows to flexible, context-aware architectures. The tool invocation and reflection capabilities of modern agents make rigid ETL/ELT pipelines less important. Contextual learning makes example curation more valuable than exhaustive example collection.
The importance of data engineering has not decreased but has changed. The skills of building data infrastructure in the past decade are still valuable, but they need to be applied to different goals. We no longer need to design every workflow in advance but need to create an environment where agents can design workflows on their own.
This article is from the WeChat official account "Data-Driven Intelligence" (ID: Data_0101), author: Xiaoxiao. Republished by 36Kr with permission.