HomeArticle

The AI Stack That No One Talks About: Data Collection as Infrastructure

王建峰2025-08-07 15:21
The AI Stack That No One Talks About: Data Collection as Infrastructure

The AI community's obsession with increasingly large models, billion-token context windows, and fine-tuning runs on GPUs is tiring. Meanwhile, the most overlooked force multiplier in the AI stack lies quietly one layer below all this: data.

Let's be clear: while scaling model size remains important, for most real-world AI products, performance improvement increasingly depends on data quality and freshness, not just the number of parameters . Doubling the model size to squeeze out marginal gains is not only costly but also environmentally unsustainable because the staggering electricity and water costs simply cannot scale.

The bottleneck has been removed from the stack.

Founders and CTOs building AI-native products are starting to realize that their agents don't miss emerging market signals or give empty insights because the "model" itself is "not smart enough" — it fails because it blindly processes outdated, irrelevant, or incomplete context. That's why Salesforce spent $8 billion in May 2025 to acquire Informatica to enhance its AI-driven Agentforce platform. Now, they can access high-quality real-time data, resulting in more accurate and scalable outcomes.

The success or failure of performance depends on what you can retrieve, not just how you prompt. Unless you're using an H100 cluster or running a cutting-edge model with an unlimited API budget, your best chance to outperform the giants is to provide the model with smarter data within your affordability: domain-specific, structured, deduplicated, and fresh data.

But before building context, it must first exist. This means reliable, real-time access to the open web — not just one-time data scraping or datasets, but a powerful pipeline that can reflect the current situation.

Folks, this is infrastructure. If computing made NVIDIA indispensable, then I believe the next major breakthrough isn't more layers but more signals and less noise. And it starts with treating data collection as production infrastructure.

What does "good data" look like?

If you're building an AI-native product, the intelligence of the system will no longer depend on how clever your prompts are or how many tokens you can cram into the context window. Instead, it depends on how well you can provide it with the context that matters right now.

But the definition of "good data" is rather vague. Let's clarify it. Its significance for AI is as follows:

Domain-specific: AI-assisted optimization of retail pricing requires competitor data, customer reviews, or regional trends, not irrelevant noise. You must be precise.

Continuously updated: The web changes in the blink of an eye. An emotion model that misses today's X trend or a supply chain model using last week's prices is already outdated.

Structured and deduplicated: Duplicates, inconsistencies, and noise waste computation and dilute the signal. Structure beats scale. Cleanliness beats bigness.

Real-time actionable: Outdated data is dead data. Real-time data — price changes, news, inventory changes — can support immediate decision-making. But only if the data collection is ethical, reliable, and scalable.

That's why Salesforce acquired Informatica — not for a new model but to provide Agentforce with structured real-time data to improve downstream decision-making.

That's also why IBM spent $2.3 billion in July 2024 to acquire StreamSets for Watsonx. StreamSets focuses on extracting data from hybrid data sources, monitoring data streams, and handling schema drift — enabling IBM to provide the latest and consistent signals to Watsonx across enterprise systems. For AI that needs to reason based on real-time status (not just historical patterns), this infrastructure can bring a 10-fold efficiency boost.

This is also why Dataweps switched to Bright Data to collect real-time competitor pricing and market trends for e-commerce customers like Philips and Asus. Their AI-driven pricing and bidding systems rely on fast and accurate data, and Bright Data's API-driven ecosystem (including proxies, archives/datasets, browser automation tools for AI agents, etc.) allows them to collect this data reliably and at scale. Bright Data is not just about data scraping; it provides the resilience, capacity, and compliance required by real-world AI systems. Frankly, it's an AI infrastructure provider.

The key is: retrieval quality now trumps prompt engineering. Even the best prompts can't fix the problem of a model extracting outdated or irrelevant data during reasoning.

Right now, it's the right environment. This is the key to the survival or demise of AI in the post-Deepseek era.

The first step is always the hardest

At first glance, data infrastructure sounds like a pipeline. Collection pipeline, transformation, storage? It seems incredibly boring. But in the era of RAG and agent AI, this pipeline has become crucial. Why? Because your system is no longer just running inferences — it reasons based on external, constantly changing, multimodal real-time information. This changes everything.

This is how I see it: the modern AI data stack has evolved into a full-fledged value chain, from information acquisition and extraction, to information transformation and enrichment, to information organization and sorting, and then to storage and provision to the right components — whether it's a model, an agent, or a human. Each layer brings real-time challenges and real consequences. Unlike traditional ETL pipelines, it's not just about getting data into a data lake and leaving it there.

Most teams mess up at the first step: collection. Poor data extraction ruins context. If your collection layer misses key updates, fails silently in edge cases, or captures information in the wrong structure or language, then your entire stack will inherit this blindness.

In other words: you can't design the context you haven't ingested. There's an interesting paper, "The Siren's Song in the AI Ocean: A Survey of Hallucinations in Large Language Models," by Zhang et al. The paper shows that in production-level systems, unresolved ingestion issues are the most common cause of "model hallucinations" and other abnormal agent behaviors.

Therefore, in the era of RAG and agent AI, ingestion needs to be strategic, and there's no doubt about it:

It must be AI-agent friendly, meaning it can provide structured, immediate data.

It must handle dynamic UIs, CAPTCHAs, changing schemas, and hybrid extraction (API + scraping).

Multi-step AI agents need both real-time signals and historical memory — what's happening now, what happened before, in what order, and why. Therefore, the infrastructure must support scheduled extraction, incremental updates, and TTL-aware routing — all with resilience, compliance, and readiness to handle changes.

It must have scale reliability and continuously provide the latest information from millions of sources.

And it must comply with website terms and legal regulations.

That's why fragile scraping tools, static datasets, and one-time connectors are no longer good enough, and why platforms like Bright Data, which focus on automation-friendly, agent-first data infrastructure, are becoming as fundamental as the models themselves.

I've seen open-source, open-weight models like Gemma 3 outperform GPT-4 in narrow domains simply because fresh, curated, domain-based data enables them to be used in better retrieval systems.

Let's do the math. Suppose we define the total utility of the retrieved context fragments as:

U = Σi = 1k RiFi

Where:

Ri ∈ [0, 1] is the relevance score of the i-th retrieved fragment to the query.

Fi ∈ [0, 1] is the freshness score, modeled as a time-decaying function (e.g., exponential or linear).

k is the number of retrieved context blocks, constrained by the model's context window.

Even assuming perfect semantic search (i.e., Ri is optimized), maximizing U may mean discarding highly relevant but outdated data in favor of slightly less relevant (but up-to-date!) signals. If your extraction layer can't keep up, it will result in a loss of visibility and a decline in utility. The second effect complements the first: not only can you not get fresh content, but the presence of outdated content also degrades performance. This leads to a compound decline in the quality of the retrieved context.

That's why data collection (including but not limited to scheduled updates, TTL-aware crawling, SERP extraction, feed parsing, etc.) is no longer just a pipeline.

What exactly does data collection infrastructure look like?

So, what does it really mean to treat data collection as first-class infrastructure?

It means:

Build circular pipelines, not loads. Data shouldn't be scraped once and archived. It should be streamed, refreshed, and updated on a schedule — with built-in automation, version control, retry logic, and traceability. One-time dumps can't provide lasting intelligence.

Incorporate freshness into retrieval logic. Data ages. Your ranking and retrieval systems should treat time drift as the primary signal — prioritizing context that reflects the current state of the world.

Use infrastructure-level sources. Scraping raw HTML from homemade scripts doesn't scale. You need access layers that provide SLAs, resilience to CAPTCHAs, schema drift handling, retries, proxy orchestration, and compliance support.

Multi-modal collection. Valuable signals exist in PDFs, dashboards, videos, tables, screenshots, and embedded components. If your system can only extract data from pure HTML or Markdown, you're missing half the information.

Build an event-native data collection architecture. Kafka, Redpanda, Materialize, and time-series databases — these aren't just for backend infrastructure teams. In AI-native systems, they'll become the nervous system for collecting and replaying time-sensitive signals.

In short, stop treating data as a static resource. Treat it as a computational resource — one that needs to be orchestrated, abstracted, scaled, and protected. That's what "data collection as infrastructure" really means.

The future lies in information > scale

Most RAG discussions stay at the model level. But in the emerging AI stack, models are interchangeable, and data infrastructure is the long-term moat.

Moore's Law may be dead, but raw performance is still steadily improving. But in the near future, I'm not convinced that the performance of AI systems will depend on fine-tuning or fast magic. I think the ultimate victory will depend on the knowledge your systems possess and the speed at which they acquire it. The smartest AI systems aren't those with the largest windows but those with the best context management capabilities — thanks to real-time data, dynamic memory, and intelligent extraction.

Therefore, as engineers, we shouldn't view every new data source, feedback, or real-time data stream as "content" but as capabilities. So, every new data stream isn't necessarily noise but a signal.

Maybe you've already built such a crucial AI infrastructure — you just might not have called it that yet.

Maybe you've started thinking about feeding data (e.g., APIs) into your own internal intelligence layer and realized: you don't need the biggest model. You just need the right pipeline.

Teams with this mindset, treating web-scale data collection as infrastructure rather than a secondary task, will move faster, learn more, and achieve success with less cost.

This article is from the WeChat official account "Data-Driven Intelligence" (ID: Data_0101), author: Xiaoxiao, published by 36Kr with authorization.