HomeArticle

Understand the Types of Data Modeling in One Article

王建峰2025-08-25 21:12
Understand the types of data modeling in one article.

If you're still directly extracting data, such as using "Select *" from logs, and expecting your analysis to keep up, it's time to reconsider your approach. Scalable and reliable data engineering, the kind practiced by top tech companies like Airbnb, Netflix, and Uber, relies on more than just basic schemas. It starts with data modeling techniques that can optimize pipelines, speed up queries, and make your engineering work smoother and more maintainable.

Data modeling is the blueprint for transforming raw, messy data into a reliable and scalable system. It allows you to design data pipelines that are easy to maintain, have fast query speeds, and are trustworthy!

In this article, you'll learn about powerful data modeling strategies used by experienced data engineers to streamline chaotic processes. These patterns are crucial for building robust data systems, whether from scratch or improving existing workflows.

1. Primary Key

Suppose you run a small ice cream shop and keep a list of customers. You might have multiple customers named Alex, but you don't want to confuse them when recording sales. So, you assign each customer a unique ID number:

Here, Customer_Key is the primary key – it's the column that uniquely identifies each row in this table. Even if two people have the same name, their Customer_Key will always be different.

Therefore, in a database,

A primary key is a column or a combination of columns in a table used to uniquely identify each row in that table.

Key Rules:

Uniqueness – No two rows can have the same primary key value.

Non-null – The primary key must always have a value (it cannot be null).

Stability – It should not change frequently.

2. Foreign Key

In our ice cream shop, we've already given each customer a unique ID in the customer list – their Customer_Key.

Now, in our sales records where we record each purchase, we don't have to write "Alice Johnson" or "Bob Smith" every time. Instead, we just write their Customer_Key.

Sales Table

And we store the customer details in another table:

Customer Table

Here, Customer_Key is the foreign key that links the sales table to the customer table.

A foreign key is a column in one table that matches the primary key in another table, thus establishing a link between the two.

Key Rules

Match Primary Key – Every value in the foreign key column must exist in the primary key column of the referenced table (unless NULL is allowed).

Can be Repeated – Unlike the primary key, the same foreign key value can appear in multiple rows (e.g., the same customer makes multiple purchases).

Can be NULL – If the relationship is optional, the foreign key can be null.

Now that we understand primary keys and foreign keys, let's take a deeper look at how they work together.

3. Fact Table

In our ice cream shop, every time someone buys ice cream, you write it down in a notebook:

Date: 2025–08–13

Flavor: Chocolate

Quantity: 2 scoops

Price: $6

This notebook is basically your fact table – it's where you record all measurable events (in this case, sales).

In the database world, a fact table stores quantitative, measurable data – data that can be counted, summed, or averaged.

So, if we create a fact table based on the sales data from the ice cream shop notebook, it might look like this:

A fact table representing sales.

Here, Date_Key, Product_Key, and Customer_Key are foreign keys that link to the date, product, and customer tables respectively. Quantity and Sales_Amount are the facts – the actual numbers we want to measure and analyze.

So far, let's draw a diagram. Later, when we introduce dimension details, SCDs, and schema types, we can continue to expand this diagram:

Date_Key ------ Product_Key ------ Customer_Key .........(Foreign Keys)\ |/\ | /Sales_Fact Table

4. Dimension Table

We have a sales fact table where each row records a sale using foreign keys:

| Date_Key | Product_Key | Customer_Key | Quantity | Sales_Amount || --------- | ------------ | ------------- | -------- | ------------- || 20250813 | 12 | 301 | 2 | 6.00 |

But these key values are just numbers. To really understand the sales, we need to know:

What date does 20250813 represent?

What product does 12 represent?

Who is customer 301?

This is where dimension tables come in – they hold all the descriptive details.

A dimension table is a table in a data warehouse that stores descriptive attributes (text or categorical information) about business entities.

For example: Customer Dimension Table:

| Customer_Key(PK) | Name | City | Loyalty Status || 301 | Alice Johnson | New York | Gold |

| 302 | Bob Smith | Boston | Silver |

Customer_Key is the primary key here.

This table describes the customers but does not contain sales figures.

So, let's draw another diagram that includes the fact table, dimension tables, primary keys, and foreign keys.

The ice cream shop data warehouse – fact table and linked dimensions.

Now that we've covered the basics, let's imagine a scenario where we want to track customers and their favorite flavors. As time passes, some customers change their favorite flavors, and we need to decide how to handle these changes in the dimension table. This is where SCDs come in.

5. Slowly Changing Dimensions (SCD)

In a data warehouse, dimension tables often contain attributes that change over time. Properly managing these changes is crucial for accurate reporting and analysis. Based on how dimension attribute changes are handled, SCDs can be divided into different types. The most common types are Type 1, Type 2, and Type 3.

Suppose Bob changes his favorite flavor from strawberry to mint. Let's see how each SCD type handles this:

| Customer ID | Name | Favorite Flavor |

| 101 | Alice | Vanilla || 102 | Bob | Strawberry || 103 | Charlie | Chocolate |

a. SCD Type 1 — Overwrite.

Type 1 simply overwrites the old value with the new one.

No historical records are kept, so we can only see the latest information.

Use it when historical data is not important.

So, the table after Bob's change (Type 1):

SCD Type 1: Bob's previous favorite flavor "strawberry" disappears.

b. SCD Type 2 — Add a New Row.

Type 2 adds a new row for each change.

This allows for complete historical tracking.

It usually includes StartDate, EndDate, or CurrentFlag to identify active records.

So, the table after Bob's change (Type 2):

SCD Type 2: Now we know Bob's previous flavor and the period during which he liked it.

c. SCD Type 3 — Add a New Column.

Type 3 retains limited historical records by adding a new column for the previous value.

It can only track one previous change.

It's simpler than Type 2 but does not retain a complete history of changes.

So, the table after Bob's change (Type 3):

SCD Type 3: We can see Bob's current and previous flavors, but the old history beyond that cannot be tracked.

6. Data Warehouse Schemas

We've covered how to manage changing data. Suppose our ice cream shop wants to analyze sales, track customer preferences, and efficiently manage inventory. To do this, we need a structured approach to organize data to answer questions like:

Which flavor is the most popular?

Which customers buy the most chocolate ice cream?

Do we have enough strawberry ice cream in stock this week?

This is where schemas and data modeling come in.

In the context of a data warehouse, a schema is like a blueprint that defines how data is organized. It shows the relationship between fact tables (where transactions or events are recorded) and dimension tables (which describe entities such as customers, products, or locations).

Think of it as a map: the schema tells you where everything is and how they are connected, so you can quickly find insights.

Why Schemas?

Now that we have the basics, we can explore different schema types – star, snowflake, and galaxy – and see how our ice cream shop can use each type to answer business questions.

a. Star Schema.

Imagine our ice cream shop wants to understand its sales. Questions like "Which flavor of ice cream sells the best?" or "How many ice creams did Alice buy this week?" should be answered quickly. The star schema is perfect for these types of questions because it organizes data in a simple and query - efficient way.

In the star schema, the sales data is at the center in what we call the fact table. This table records all transactions – who bought what, how much they paid, and when. Surrounding this fact table are dimension tables that describe the detailed information of each entity, such as customers or flavors. We can think of it as a star:

The fact table is the center, and the dimension tables are the points radiating outwards.

In our ice cream shop example, the "Sales" fact table records transactions, while the "Customer" table and "Flavor" table provide descriptive context. For example, if the ice cream shop wants to know how many chocolate ice creams Charlie bought, it just