Data Analysis Systems: A Complete Evolution from Business Systems to Data Intelligence

From business systems to data intelligence: The complete evolution of data analysis systems

Imagine this: You're standing in a bustling retail store in 1985. Every time a customer makes a purchase, the cash register immediately records the transaction details - the item, the price, and the timestamp. This is the store's pulse, the rhythm of its operations. Now, imagine trying to answer questions like: "Which products were the top sellers last quarter?" or "What are the revenue trends across all our stores?" Suddenly, the system that can process thousands of transactions per second struggles to provide answers. It's like asking a sprinter to run a marathon - the same athlete, but a completely different race.

The fundamental tension between recording what's happening in the moment and understanding its meaning has driven innovation in data systems for the past five decades. Today, we'll trace this journey, from the earliest transactional databases to today's AI-driven analytics platforms that can answer questions in natural language. Understanding this evolution isn't just about learning history; it's a roadmap for making better architecture decisions today.

The Two Worlds of Data: OLTP and OLAP

Before diving into the timeline, let's establish the core difference that shapes everything that follows.

Online Transaction Processing (OLTP) systems are designed to handle a company's day-to-day operations. You can think of them as cash registers - they need to be fast, accurate, and always available to record every transaction. When you order a product online, update your profile, or transfer money between accounts, you're interacting with an OLTP system.

On the other hand, OLAP (Online Analytical Processing) systems are designed for analysis and reporting. They're like the comprehensive reports accountants generate at the end of the month, aggregating thousands or even millions of transactions to reveal patterns, trends, and insights. OLTP focuses on "What just happened?", while OLAP focuses on "What does it all mean?"

The fundamental difference lies in:

OLTP systems are optimized for quickly writing large numbers of small transactions and instantly reading specific records. OLAP systems are optimized for reading vast amounts of data, aggregating it, and performing complex calculations across multiple dimensions. One system can't efficiently do both tasks - this realization has spurred decades of architectural innovation.

Key takeaway: OLTP handles real-time operational transactions; OLAP handles analytical queries based on historical data. These opposing requirements have led to completely different system designs.

The Rise of OLAP and Data Cubes (1990s)

By the 1990s, businesses were desperate for faster ways to analyze data. The solution came in the form of dedicated OLAP systems, which introduced a revolutionary concept: data cubes. Instead of storing raw transaction data, these systems pre-aggregated data across multiple dimensions.

You can think of a data cube as a multi-dimensional spreadsheet. For example, you could use sales as the measure (the value you're counting) and set three dimensions: time (year, quarter, month), product (category, brand, SKU), and geography (country, region, city). Each cell in the cube contains a pre-calculated value, like "Total sales of Nike running shoes in California in Q3 2023". This means queries that used to take hours can now return results in seconds.

Three architectural design approaches emerged:

MOLAP (Multidimensional OLAP) systems like Hyperion Essbase and IBM Cognos store data in optimized multi-dimensional arrays, achieving amazing query speeds but requiring a lot of preprocessing and struggling with very large datasets.

ROLAP (Relational OLAP) systems like MicroStrategy build OLAP capabilities on top of existing relational databases, offering more flexibility and scalability but slower query performance.

HOLAP (Hybrid OLAP) systems like Microsoft Analysis Services try to combine the best of both worlds, storing detailed data in relational tables while keeping aggregated summaries in multi-dimensional cubes.

The business drivers were clear:

Executives and analysts needed dashboards and reports to make data-driven decisions. Tools like Business Objects, Cognos, and Crystal Reports became the front-end interfaces, while OLAP engines handled the back-end calculations.

Key takeaway: OLAP cubes solved the analytics performance problem through pre-aggregation across dimensions, but required a lot of ETL processing and were difficult to scale to modern data volumes.

The Data Warehouse Boom (Late 1990s - Early 2000s)

As the value of OLAP became more apparent, organizations realized they needed a centralized repository optimized specifically for analysis. Data warehouses were born - they're a subject-oriented, integrated, time-variant, and non-volatile collection of data designed to support business intelligence.

The canonical architecture emerged: Extract, Transform, Load (ETL) pipelines extract data from multiple source systems (CRM, ERP, transactional databases), clean and transform it according to business rules, and load it into the data warehouse according to a carefully designed schema.

Two schemas dominated:

The Star Schema organizes data around a central fact table (like a sales table), surrounded by dimension tables (like customer, product, time, and location tables). This denormalized structure optimizes read performance at the expense of storage space.

The Snowflake Schema further normalizes the dimension tables to reduce redundancy, creating a more complex structure that looks like a snowflake when represented graphically.

Enterprise data warehouses from Teradata, Netezza, and Vertica introduced key innovations. They used columnar storage instead of row-based storage - this revolutionized data analysis. When calculating the average price of a billion sales records, columnar storage allows you to read only the price column from disk, ignoring all other data. This significantly improves data compression (similar values can be compressed together well) and speeds up queries.

They also implemented Massively Parallel Processing (MPP) architectures, distributing data and queries across multiple nodes that work together. Now, you can scale horizontally by adding more machines to the cluster.

But this approach had limitations. The schema had to be defined upfront (schema-on-write), making it costly to add new data sources or change business logic. Hardware scalability was still limited, and these systems could cost hundreds of thousands or even millions of dollars.

Key takeaway: Data warehouses centralized analysis through columnar storage and MPP processing, but were still expensive, inflexible, and limited by hardware.

The Big Data and Hadoop Era (Late 2000s - 2010s)

Then, the internet changed everything. Companies like Google, Yahoo, and Facebook were flooded with massive amounts of new types of data: web logs, clickstreams, sensor data, social media posts, images, and videos. This data wasn't the structured transactional data that could fit neatly into database tables; it was messy, semi-structured, and in unprecedented quantities.

Traditional data warehouses couldn't handle this. They were designed to store structured data with a known schema, not raw log files or JSON documents. It wasn't economically feasible either - you couldn't afford to buy enough Teradata licenses to store petabytes of web crawler data.

Google published papers describing its internal systems: GFS (Google File System) for distributed storage and MapReduce for distributed computing. These systems inspired the birth of the open-source Hadoop ecosystem, which in turn became the foundation of the "big data" movement.

Hadoop introduced two revolutionary ideas:

HDFS (Hadoop Distributed File System) distributes data across commodity hardware and automatically replicates it for fault tolerance. You can use hundreds of cheap machines instead of expensive, specialized hardware to store petabytes of data at low cost.

MapReduce allows you to write parallel computations that run automatically across all machines. The framework handles all the complexity of task distribution, fault management, and result aggregation.

Projects like Apache Hive added SQL-like query capabilities on top of MapReduce, allowing analysts to query the data lake using familiar syntax. Impala and Presto (now Trino) provided faster, more interactive SQL engines. Spark, with its in-memory computing model, became a more flexible and high-performance alternative to MapReduce.

This era introduced the concept of a data lake - a centralized repository that stores all data in its raw form, using schema-on-read instead of schema-on-write. You can ingest data first and then figure out how to use it.

But Hadoop had serious limitations for OLAP workloads. Queries often took minutes or even hours. It didn't support transactions or updates - you could only append data. Operational complexity was extremely high, requiring a large team to keep the cluster running.

Key takeaway: Hadoop made big data storage and processing accessible using commodity hardware, but its high latency and operational complexity made it unsuitable for interactive analysis.

Cloud Warehouses and the Separation of Compute and Storage (2010s)

Just as Hadoop's popularity reached its peak, a new generation of cloud-native data warehouses emerged with a completely different architecture. Snowflake (released in 2014), Google BigQuery, and Amazon Redshift redefined what a data warehouse looks like in the cloud era.

The breakthrough was the complete separation of compute and storage. Traditional data warehouses tightly coupled the two - data was stored on the same machines that processed queries. This created a dilemma: Do you buy enough capacity to handle peak query loads (wasting resources most of the time) or enough for the average load (sacrificing performance during busy periods)?

Cloud data warehouses solved this problem by storing all data in cheap, durable object storage (like S3, Google Cloud Storage, and Azure Data Lake Storage) and spinning up compute clusters on demand. Need to run a large report? Just scale up to 100 nodes, run the query in minutes, and then scale back down. You only pay for compute when the query is running, and the cost of static storage is very low.

Snowflake pioneered the "multi-cluster shared data" architecture with true elasticity. BigQuery took it a step further with a serverless model, where users don't even need to configure a cluster - just write SQL, and Google automatically allocates resources.

These systems also brought other advantages:

Cloud economics: The pay-as-you-go pricing model means startups can access enterprise-grade analytics without a huge upfront investment.

Instant elasticity: Scale compute up or down in seconds based on the workload.

Zero management: No need to manage any hardware, with automatic backups, updates, and optimizations.

Data sharing: Easily and securely share datasets between organizations.

Thanks to innovations in columnar formats, advanced compression, and intelligent query optimization, performance was also excellent. Cloud data warehouses can scan terabytes of data in seconds.

Did you know?

Snowflake was founded by several data warehouse veterans from Oracle who asked, "What if we designed a data warehouse completely for the cloud era, free from the limitations of traditional systems?" As a result, Snowflake became the first data company IPO in 2020 with a valuation of over $70 billion.

Key takeaway: Cloud data warehouses revolutionized analysis by decoupling storage and compute, enabling instant elasticity, eliminating operational overhead, and making enterprise-grade analytics accessible to organizations of all sizes.

Open Table Formats and Cloud Query Engines (Late 2010s - 2020s)

While cloud data warehouses were powerful, they introduced a new problem: data lock-in. Once data is in Snowflake or BigQuery, it's in a proprietary format. Want to use other tools? You have to copy or export the data, doubling the cost and complexity.

Meanwhile, data lakes stored data in open formats but lacked the key features that made data warehouses work: ACID transactions (atomicity, consistency, isolation, durability), schema evolution, time travel, and efficient updates and deletes.

The solution came from open table formats, which brought database-like capabilities to data lakes:

Apache Iceberg (created by Netflix) provides full ACID transactions, schema evolution, hidden partitioning, and time travel for data in object storage. It maintains metadata separately from the data, enabling efficient query planning and partition pruning.

Delta Lake (open-sourced by Databricks) offers similar capabilities and is tightly integrated with the Spark ecosystem. It uses a transaction log to maintain ACID properties and supports streaming writes and batch reads.

Apache Hudi (created by Uber) is specialized for incremental data processing and efficient upsert operations, making it ideal for change data capture scenarios.

These formats are open-source and interoperable, storing data in standard columnar formats like Parquet while retaining rich metadata about transactions, schemas, and partitions.

Alongside the table formats, a new generation of query engines emerged:

Trino (evolved from Presto) provides blazingly fast SQL across various data sources - you can query data in S3, PostgreSQL, and MongoDB with a single query.

Dremio introduced data reflections (automatically maintained aggregations) and a semantic layer to speed up queries.

DuckDB brings excellent analytical performance to embedded scenarios, running entirely in-process without any configuration.

AWS Athena and Starburst offer managed query services on top of these open formats.

The final piece is the open metadata catalog:

AWS Glue, Apache Polaris, Unity Catalog, and Gravitino provide centralized metadata management, access control, and data governance for the entire ecosystem. Now, you can have a single source of truth about your data, no matter which engine queries it.

This convergence gave rise to the Lakehouse architecture - it combines the flexibility and openness of a data lake with the performance and features of a data warehouse. You can get OLAP performance without worrying about vendor lock-in and support both BI workloads and machine learning on the same dataset.

Key takeaway: Open

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。