Transforming Scattered Data into an AI Platform: My Journey

How did I transform scattered data into a platform for artificial intelligence

A hierarchical architecture that integrates dozens of heterogeneous data sources, implements artificial intelligence, and feeds the results back into the entire operation, with zero risk and no need to rewrite any line of code in the original system.

All companies that have been in operation for more than five years face the same problem: no one planned the interconnection between various systems at the beginning. The CRM system runs on MySQL, the financial system runs on PostgreSQL, the operation platform uses a completely different MySQL instance, marketing data runs through spreadsheets and SaaS tools with their respective API interfaces, and the data is scattered in queues, caches, storage buckets, and NoSQL databases - these databases were built for a pilot project at that time and have been used ever since. Each system was the best choice at the stage of its birth. The real problem has always been the lack of an integration strategy.

When a supervisor asks, "I want a report on customer engagement and revenue," the honest answer from many companies is, "We need two weeks and three Excel spreadsheets put together."

This article describes the architecture I designed and implemented to solve this problem, focusing on the rationale behind each decision.

Scenario: There is a large amount of data, but it has no practical use.

Over the years, customers have accumulated dozens of systems: different versions of MySQL databases, PostgreSQL instances, third - party APIs, shared spreadsheets serving as the "database" for an entire department, and data in SaaS tools that can only be accessed through manual exports. The amount of data is quite substantial, but there are three key problems:

Fragmentation. The same entity (customer, contract, transaction) exists in two or three systems, which have different schemas, different IDs, and different data update levels. There is no single data source.

Outdated. Abandoned fields, duplicate records, and categories that no longer reflect the actual operation of the enterprise. The data ages in the database and is not maintained.

Opacity is a decisive factor for artificial intelligence. The company hopes to use artificial intelligence for recommendation, automation, and predictive analysis. Artificial intelligence relies on consistent and structured data. If the language model receives fragmented context, it will produce fragmented responses. If the scoring model inputs duplicate data, the generated rankings will not be trusted.

The challenges are both technical and strategic: to make the company's data assets available for artificial intelligence without interrupting the operations that rely on existing systems.

Principle: Zero impact on resources

The first, and perhaps the most important, architectural design decision is to keep the original systems intact.

Projects such as rewriting the legacy customer relationship management system (CRM) or migrating the financial enterprise resource planning (ERP) system to a unified database often take months, bring operational risks, and often fail. I've seen this happen firsthand. After the "brand - new upgraded" system was launched, it only retained 70% of the functions of the old system, and the operations team had to spend six months manually compensating for these missing functions.

The approach I adopted is based on a simple premise: keep all the original applications running as they are and continue to provide data to their respective databases, APIs, and storage devices as usual. Each data source, regardless of the technology used (MySQL, PostgreSQL, REST API, storage bucket files, SaaS exports), will be mirrored incrementally and non - intrusively to a centralized data lake.

In fact, there is no change in the daily work of the operations team. The customer relationship management system (CRM) continues to run, the enterprise resource planning system (ERP) continues to issue invoices, the web platform continues to serve users, and the marketing department continues to fill out spreadsheets. In the background, every change in all these data sources will be captured and replicated to a separate storage layer, regardless of whether these changes come from a relational database, an API, or a CSV file in a storage bucket.

Hierarchical architecture

The final design is divided into four different layers, each with clear responsibilities.

Layer 1: Original applications

Different versions of MySQL, PostgreSQL, SaaS APIs, spreadsheets, message queues, and files in storage. Each system continues to run as usual. The only interaction points are the capture mechanisms suitable for each technology: CDC for relational databases, API connectors for SaaS, file imports for exports and spreadsheets, and consumers for queues. The difficulty lies in how to handle this diversity without imposing a single standard on the source systems. This layer is only observed and never modified.

Layer 2: Data lake

The data here is organized into three areas:

The raw data area directly receives the data output from the source systems without any transformation or cleaning. It is a true historical record of the data generated by each system and is crucial for auditing and re - processing when business rules change.

The staging area performs preliminary transformations: data cleaning, deduplication, type correction, and quality verification. For example, the "phone" field, which was originally a free varchar type in MySQL, is standardized here. Duplicate records will be identified and marked.

In the "curated area", the data is given a business structure. Entities are integrated: the "customers" that were originally scattered across three systems now become a single entity with unified attributes. The relational model begins to reflect the actual situation of the business and is consistent with operational and corporate terminology.

Layer 3: Enrichment and intelligence

At this layer, the data begins to play a role in artificial intelligence, which is also the main manifestation of the differentiated value of this architecture.

Based on the integration in the curated area, this layer runs various processes to add information. Scoring models calculate propensity, relevance, and risk. Matching algorithms cross - compare profiles to generate recommendations. External enhancement functions add third - party data (geocoding, market data, supplementary records). Business metrics are pre - calculated and presented for quick use.

The key is that some end - applications are already running at this layer. For example, the matching system runs as a service and can consume and generate data almost in real - time. This layer is an indispensable part of the entire operation process.

Layer 4: End - applications (consumption and feedback)

Dashboards, intelligent APIs, automated processes, and AI agents. All these applications use the integrated, cleaned, and enriched data. When an AI agent answers questions about customers, it queries a single data layer that has integrated and enriched information from all sources.

A basic feature of this layer is that it can feed back to all previous layers. The scores calculated by machine - learning models can be written back to the source database, so traditional applications can also benefit from the generated intelligent information. The new data generated by automated processes will flow into the data lake. The results of AI agents will trigger the recalculation of metrics in the enhancement layer. The entire system forms a cycle: each layer benefits from the contributions of other layers.

Important decisions

Some technical decisions are worthy of attention because they are crucial to the success of the project.

Use change data capture (CDC) for databases and connectors for all other data. Change data capture only replicates the changed content in relational databases, thereby reducing the latency window and processing costs. But CDC is just one of many strategies. For SaaS APIs, connectors with regular synchronization can be used. For spreadsheets and exported files, a file - event - based ingestion method can be used. Choosing the appropriate mechanism for each data source is part of the architectural design, and any wrong choice will lead to pipeline fragility.

Schema - on - read in the raw area. Allowing data to arrive without a fixed schema can absorb changes in the source system without interrupting the data pipeline. When the product team adds a field to the CRM, the data lake will automatically absorb it. The transformation occurs in subsequent layers, which can be controlled.

Materialized views of metrics. The pre - calculated key performance indicators (KPIs) are updated incrementally without the need for complex real - time queries. The dashboard loads in less than 2 seconds, thanks to the heavy computational work being completed in advance.

Separate data enrichment from data consumption. Separating the intelligence layer from the consumption layer allows different applications to use different levels of processing. A simple dashboard can directly obtain data from the curated area. An AI agent, on the other hand, requires data that has been enriched. This flexibility avoids over - designing for simple use cases.

Feedback loops between layers. Allowing end - applications to write data back to previous layers makes the architecture a dynamic system. The data is continuously improved in each cycle: the scores generated in Layer 3 can optimize the processing in Layer 1, resulting in higher - quality data in the next data ingestion cycle.

Result: Data as a strategic asset

The company has obtained a unified data platform that lays the foundation for every artificial intelligence and automation initiative in the company's operations.

Functions that originally took weeks of manual work can now be deployed in a few days. New AI agents can connect to the platform without custom integration. New applications can use the integrated and enriched data from the start.

None of the original systems were stopped or rewritten. The operational risk is zero. While the data infrastructure was being built in parallel, the team's work continued as normal.

For those facing the same challenges

If your company's data is scattered across multiple systems and you're struggling to make artificial intelligence projects feasible, the most productive question is: "Is our data ready to reliably provide data for artificial intelligence?"

In most cases, the answer is no. The future direction is to build an intelligence layer on top of existing systems.

This is the type of project I built and delivered: data maturity assessment, integrated architecture design, and platform implementation to transform fragmented data into a strategic asset ready for artificial intelligence.

This article is from the WeChat official account "Data - Driven Intelligence" (ID: Data_0101), author: Xiaoxiao, published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

How did I transform scattered data into a platform for artificial intelligence?

Principle: Zero impact on resources

Hierarchical architecture

Important decisions

Result: Data as a strategic asset

For those facing the same challenges