HomeArticle

The data catalog finally comes in handy because the AI Agent will read it.

王建峰2026-04-30 17:44
The data catalog finally comes in handy because the AI Agent will read it.

The Most Expensive Shelf in the Building

Every large enterprise has a data catalog. Most of them were built between 2018 and 2024, during a period when there was a wave of data democratization, self - service analytics, and the then - revolutionary idea that people should be able to find and understand the data generated by their organizations.

This solution is extremely attractive. A searchable catalog that includes all data sets, tables, and columns within the organization, accompanied by rich descriptions, ownership information, data lineage, quality scores, and usage statistics. Any employee can discover existing data, understand its meaning, and evaluate its credibility through a unified interface. This will end the era of "tribal knowledge" and usher in a new era of data literacy.

The reality is quite different. Organizations have spent millions of dollars building catalog platforms, hired data governance teams to populate the data, and launched internal promotion campaigns to drive user adoption, only to watch the usage rate stagnate. Catalog updates are sporadic, search volumes are few, and there are fewer and fewer maintenance staff, who feel like they are managing an unattended museum.

One survey after another has confirmed this pattern. The adoption rate of data catalogs hovers between 10% and 25% of the target user group. Most search operations are performed by the same small group of advanced users - usually data engineers who need data lineage information, rather than the business analysts the catalog was originally designed to serve. In most organizations, the dream of self - service data discovery has quietly faded, killed by human nature: people prefer to ask colleagues rather than search the catalog, and they prefer familiar data sets to documented but unfamiliar ones.

These product catalogs themselves are not bad. They are solving the wrong problem - or more accurately, solving the right problem for the wrong consumers. It turns out that humans have mediocre ability to consume structured metadata. They only skim through descriptions, ignore data dictionaries, and rely on pattern matching and institutional memory rather than systematically discovering information.

Artificial intelligence agents are the exact opposite. And this changes everything.

The Perfect Consumer Appears

Large language models and agent systems built on these models have a series of characteristics that make them ideal catalog consumers - these characteristics are almost the exact opposite of human behavior.

Artificial intelligence agents read exhaustively. When human analysts search for relevant data sets in the catalog, they usually read the first three results, quickly skim through the descriptions, and then choose the one with a suitable - sounding name. In contrast, artificial intelligence agents will read every description of each candidate data set, every column annotation, and every piece of attached metadata, and make a choice based on complete information. In the past, writing detailed column descriptions was a daunting task because it was basically a waste of time when the audience was humans who wouldn't read them; but when the audience is a language - learning artificial intelligence (LLM) that will read everything, this task becomes crucial.

Artificial intelligence agents do not have experiential knowledge. An analyst who has worked in the company for three years knows that a certain version of the revenue table in revenue_metrics_v2 is the authoritative one, and revenue_final_BACKUP should be ignored even if its modification date is more recent. Artificial intelligence agents do not have this kind of knowledge. They rely entirely on the metadata in the catalog to distinguish between authoritative sources and deprecated copies. The catalog is not just a convenient tool for the agent, but the entire manifestation of the agent's understanding of the data landscape.

AI agents query programmatically. When an AI agent needs to answer a question about the quarterly revenue of each product line, it will not launch a BI tool and browse dashboards, but will build a query. To build the correct query, it needs to know which table contains revenue data, which column represents net recognized revenue and total transaction volume, which dimension table contains the product hierarchy, and how to connect them. All this information (if available) is stored in the product catalog. The agent's ability to generate the correct query is directly proportional to the quality of the product catalog metadata.

Artificial intelligence agents can operate at scale. Human analysts may only search the catalog a few times a week. However, a cluster of artificial intelligence agents serving the organization's data queries may query the catalog thousands of times a day, each time collecting contextual information about tables, columns, relationships, and quality scores to formulate their processing methods. The catalog has also changed from an occasionally consulted reference document to a key runtime dependency for continuous queries.

This disruption has changed the economic landscape of data cataloging. For a decade, the value of catalogs has been limited by human adoption rates, which have remained stubbornly low. Now, with artificial intelligence agents as the main users, the value of catalogs is only limited by the quality of their content - suddenly, every description, every annotation, every piece of ownership and lineage metadata has a measurable impact on the accuracy of the answers generated by artificial intelligence.

What AI Agents Really Need from the Catalog

Not all catalog metadata is equally valuable to artificial intelligence agents. The features that catalog vendors promote to human users - such as visual lineage diagrams, social features like "likes" and "follows", and fancy data quality trend dashboards - are mostly irrelevant to LLMs (Lifecycle Management). What agents really need is specific, structured, and surprisingly ordinary information.

Precise column - level descriptions. Not "Customer transaction amount", but "The total transaction amount in US dollars, including taxes and without deducting refunds, recorded at the time of purchase authorization, not at the time of settlement". The more precise the description, the less likely the agent is to misuse the column. This is exactly the drawback of the insufficient investment in catalog quality in the past decade: most catalogs only have table - level descriptions at best, and column - level descriptions are either missing or too general to be useful.

Canonical identification. For any business concept - such as revenue, number of customers, customer churn rate - the catalog must clearly identify which table and column are the authoritative sources. Otherwise, the agent will face the same problem as a new employee: a warehouse with twelve tables that may contain "revenue" data, but no way of knowing which table the Chief Financial Officer considers to be the correct one. Canonical markings or labels in the catalog can eliminate this ambiguity.

Relationship and connection metadata. Schema foreign keys capture structural relationships. Catalog metadata should capture semantic relationships: these two tables can be joined based on a certain key customer_id, but table B must first be filtered with status = 'active' to avoid double - counting. These connection conditions are the "experiential knowledge" that humans often keep in mind and never record. But for artificial intelligence agents, recording them is the only option.

Freshness and quality signals. Agents that build queries not only need to know what data exists, but also whether the data is up - to - date and reliable. A catalog that can display freshness metadata (such as the last refresh time, expected update frequency, current quality score) enables the agent to make informed decisions based on this information, determining which data sources are trustworthy and which need special attention.

Use cases and approved use scenarios. The most advanced catalog implementations include metadata about the expected use of data sets. "This table is the authoritative data source for financial reporting" is very different from "This table is an experimental feature store used by the machine - learning team". Agents that understand these use cases can select appropriate data sources based on the context, rather than simply choosing the table that seems most relevant based on the column name.

The Feedback Loop That Changes Everything

This is where the story gets really interesting. Artificial intelligence agents not only receive catalog metadata but also generate signals to improve the catalog.

Every time an artificial intelligence agent queries the catalog, selects a data set, builds a query, and generates results, it produces rich feedback signals. Which data sets were considered, and which were selected? Which descriptions were sufficient for the agent to make a reliable choice, and which needed additional contextual information? In which cases did the agent's query produce incorrect results due to ambiguous or incomplete metadata?

This feedback loop is transformative because it solves the maintenance problem that initially hindered the popularization of the catalog. The reason the catalog becomes outdated is that the cost of maintaining the catalog is borne by humans, who receive almost no direct benefits. Data engineers who write column descriptions are doing unpaid work for hypothetical future users who may never read these descriptions.

With artificial intelligence agents as consumers, the feedback loop is immediate and measurable. If a column description is missing, the agent will misinterpret the column, resulting in incorrect answers, and then generate an error signal that can be traced back to the missing metadata. The cost of poor - quality metadata is no longer hypothetical - it causes a quantifiable decline in artificial intelligence accuracy.

This forms a virtuous cycle: the use of artificial intelligence will discover metadata gaps. After these gaps are filled, artificial intelligence will produce better results, which will drive more use of artificial intelligence, and in turn, discover more gaps. The improvement of the catalog does not come from ambitious management efforts, but from the natural pressure brought by a tireless and demanding user - when the metadata is incorrect, the user will issue a warning.

Some organizations go a step further and use Lifecycle Management (LLM) to help populate catalog metadata. Agents can read the table schema, check sample data, cross - reference existing documents, and generate draft descriptions for human review and approval. This technology not only makes the catalog more convenient as a consumer but also makes it easier to maintain as a contributor.

Rethinking the Catalog Architecture for AI Consumption

Most existing product catalogs are designed with a web user interface as the main interface, and the REST API is an afterthought. For native AI consumption, this hierarchy needs to be reversed.

The catalog's API becomes the main interface. It must support efficient lookups by business concept ("Find the authoritative source of quarterly revenue"), technical reference ("Describe all columns in the order table"), and relationship ("Which tables can be joined with the customer_profiles table, and under what conditions"). These queries need to return structured, LLM - friendly responses, rather than HTML pages designed for human reading.

The response format is crucial. If a catalog API returns a text description of a table, it is far less useful to the agent than an API that returns structured metadata (such as column names, types, descriptions, update timestamps, quality scores, canonical flags, and connection conditions), and the metadata must be presented in a parsable format. The agent needs to understand the metadata, not just read it.

Latency is crucial. When catalog lookup is an important part of the critical path for an artificial intelligence agent to build a query (which is becoming more and more common), a response time of a few hundred milliseconds is acceptable. However, a response time measured in seconds is unacceptable. This has a profound impact on the catalog infrastructure, which many vendors have not fully realized.

Version control is crucial. When a catalog entry changes - such as a column description update or a canonical source identification change - downstream AI agents need to be notified. Catalog changes should be version - controlled and issued as events so that agents can invalidate cached metadata and avoid operating on an outdated catalog state. This is the same as the cache invalidation pattern in software systems, but applied to metadata.

The Unsettling Audit

If your organization has a data catalog, it's time to audit it from a brand - new perspective - specifically, from the perspective of an artificial intelligence agent that takes every description literally and has no access to any tribal knowledge.

The audit should answer the following five questions:

What percentage of the tables used by AI systems contain column - level descriptions? In most organizations, this number is surprisingly low. For any column without a description, the AI agent will interpret it based only on its name - that's why when you should use `.` for cust_txn_amt_usd, it will sum `.` for rev_net_recognized.

Are the authoritative sources clearly marked? For each key business concept - such as revenue, number of customers, customer churn rate, Annual Recurring Revenue (ARR), and Net Promoter Score (NPS) - can the catalog clearly indicate its authoritative source? If a human interpretation of "it depends" is required, it means the catalog is incomplete.

Are the connection conditions recorded? This includes not only foreign - key relationships but also semantic qualification conditions: which filters, which conditions, and which boundary cases. This is usually the biggest gap and has the greatest impact on the correctness of queries.

Is the freshness metadata available and accurate? Can the agent determine whether the data set is up - to - date before using it? Outdated freshness metadata is worse than no freshness metadata because it creates false trust.

Is the performance and structure of the catalog API good? Can artificial intelligence agents query the catalog programmatically, receive structured responses, and integrate them into their own reasoning processes in real - time? If the catalog can only be accessed through a web user interface, it is inaccessible to the most important user group.

The Return on a Decade - Long Investment

This story is quite ironic. For a decade, data governance teams have created unattended catalogs, written descriptions that no one refers to, and maintained lineage relationships that no one follows. They have been repeatedly told that catalogs are crucial for data governance and compliance, but the actual adoption rate is the exact opposite.

Now, every word they write has become crucial. Every column description will be read - not by human analysts, but by artificial intelligence agents, which will interpret the descriptions word for word and use them to generate query statements that drive business decisions. Every canonical label, every relationship annotation, and every quality score are relied on by systems that cannot do without them.

In the early stage of the popularization of artificial intelligence applications, organizations that have continuously invested in catalog quality now find that they have a strategic advantage. Due to richer metadata, their AI agents can produce more accurate results. Since the catalog is actively used rather than passively maintained, their data governance capabilities are also stronger. Since agents can discover and understand data without human intervention, they can realize value in new AI use cases in a shorter time.

Organizations that have allowed their catalog data to atrophy are now busy filling metadata gaps in order to connect artificial intelligence agents to their data - because they have painfully realized that if an artificial intelligence agent can access a poorly - documented database, it is not a productivity tool but a burden.

Investing in a data catalog has never been a wrong investment. It is an investment ahead of its time - waiting for the consumers who will finally make it indispensable.

This article is from the WeChat official account "Data - Driven Intelligence" (ID: Data_0101). Author: Xiaoxiao. Republished by 36Kr with permission.