HomeArticle

5 Ways to Uncover the Value of "Unstructured" Data

王建峰2025-12-09 12:04
The rules of the game have changed.

Typically, we treat the value of text and images as secondary matters. By 2026, if your data platform can't connect SQL tables with PDF files, it's useless!

There's a hidden secret in the modern data stack. In the past decade, we've been obsessed with the 10% of "clean" data in enterprise data—the rows and columns neatly organized in Snowflake, BigQuery, or Databricks. We've built complex governance, lineage, and observability systems around integer fields and standardized timestamps.

Meanwhile, the remaining 90% of the data—the enterprise's "dark matter"—is rotting in S3 or GCS storage buckets and cloud drive folders: PDFs, emails, call records, and images.

By 2025, the rules of the game have changed. Your CEO no longer cares about the star schema you've painstakingly built. They just want to ask: "Which supplier contracts (PDFs) have termination penalties higher than the revenue (SQL) we earned from these contracts last quarter?"

If your answer is: "I need three weeks to build a custom crawler and a separate vector pipeline," then you're outdated.

Here are the less appealing truths about the state of unstructured data in 2025 and why your SQL-only skills will become a liability.

The Non-existent "Connection"

The fundamental disconnect in 2025 is that we still lack a native, high-performance connection between the semantic concept of LEFT JOIN and relational keys.

We have vector databases for similarity searches and relational databases for exact logic. Connecting them is like taping a jet engine to a horse—it's an engineering challenge.

The reality is: You can use vector search to find "similar" contracts and SQL to find "revenue." However, precisely mapping specific paragraphs in scanned PDFs to specific transaction IDs in a Postgres table is a nightmare of fuzzy matching, illusions, and broken lineage.

The less appealing solution: We're seeing the rise of "AI functions" inside data warehouses. This trend isn't about moving data to vector databases but rather bringing logical logical models (LLMs) into the data itself.

Expert advice: Stop building standalone "unstructured data platforms." Aim for an architecture that can run SELECT extract_contract_value(pdf_blob) FROM documents directly in the main data warehouse. If your platform can't perform SQL reasoning, migrate.

"Token Tax" Is the New Cloud Bill Shock

In 2020, we panicked about Snowflake credits; in 2025, we're panicking about Token consumption.

Treating unstructured data as a first-class citizen means digitizing it. However, extracting structure from millions of documents through multi-modal language learning models (such as GPT-4o or Gemini 3 Pro) is not only slow but can also cause huge economic losses if done blindly.

Statistics show: Without optimization, processing 1 PB of unstructured text for RAG (Retrieval Augmented Generation) could cost up to $150,000 in API fees.

The less appealing solution: Small language models (SLMs). You don't need an inference model to extract dates from invoices.

Expert advice: Build a "model routing." Use inexpensive small BERT models or dedicated SLM models to handle 90% of the extraction work (OCR, classification, entity extraction). Only use expensive "smart" models when dealing with complex inference tasks. Your CFO will thank you.

OCR Is Still the Worst Part of Your Job

We have general artificial intelligence (AGI)-level reasoning capabilities, but we still struggle to read tables spanning two pages in a PDF.

The "unstructured" problem is often just a disguised "parsing" problem. Most RAG pipelines fail not because of flaws in the LLM itself but because PDF parsers scramble text, merge two columns, or ignore crucial footnotes.

The reality is:  " Garbage in, hallucination out.  " If your parsing tool feeds a jumbled mess of headers and footers to the model, no amount of prompt engineering will help.

The less appealing solution: Multi-modal parsers. By the end of 2025, the trend is shifting from heuristic parsers (like PyPDF2) to Visual-LLM-based parsers that "look" at document screenshots to understand the layout before reading the text.

Expert advice: Invest heavily in the data ingestion layer. The return on investment from better parsers is 10 times higher than that from better language learning models (LLMs).

Metadata Is the New Gold Again

Vector search is probabilistic; it's a guess. In highly regulated industries, saying "I think this is the right document" could land you in court.

To make unstructured data usable, you need deterministic anchors, which are metadata. By 2025, the most successful data teams don't just embed text; they add structured attributes (such as customer ID, date, region) to the text using agents before it enters the vector store.

The less appealing solution: Hybrid search.

Expert advice: Never rely solely on semantic search. Your retrieval strategy should always be (Vector Similarity) AND (SQL Filter): Ensure that each piece of unstructured data you ingest contains at least 3 - 5 structured metadata fields.

The Rise of Document "Data Products"

We used to treat files as "data chunks." Now, they're products.

In 2025, PDF contracts are no longer just files but containers for data products: containing lists of obligations, payment plans, and risk profiles. The job of data engineers is to break down this container into usable, queryable data assets.

The future: We're moving towards a "universal data lake" (thanks to open formats like Apache Iceberg), where images, videos, and text coexist with tables, all managed by a single catalog.

Expert advice: Audit your data catalog. If searching for "Q3 financial data" returns tables instead of PDF reports, there's a problem with your catalog.

The future isn't about the SQL vs. NoSQL debate but about the battle between structured and unstructured data and how quickly we can bridge the gap between them.

This article is from the WeChat official account "Data-Driven Intelligence" (ID: Data_0101), author: Xiaoxiao. It is published by 36Kr with authorization.