Unveiling Large Models: Analyzing Li Feifei's Three Categories of World Model Products to Grasp the Next Trillion

Everyone is building their own "small world model".

If the most important keyword in the AI industry in the past three years was "Large Language Model (LLM)", then after entering 2026, another term began to frequently appear in the headlines of investment institutions, robotics companies, and technology media - World Model.

Recently, Fei-Fei Li, a professor at Stanford University, a well-known scientist in the field of AI, and the founder of World Labs, published a long article defining and systematically classifying the World Model.

The World Model has evolved from a technical concept to a financing concept. Although this concept may seem new, in fact, many people are already engaged in research and development related to the World Model, and some have launched mature products.

Why did the concept of the World Model

suddenly become a huge hit in 2026?

The "World Model" has existed for many years in the fields of reinforcement learning and robotics research.

In the past year, it has become a hot topic pursued by the capital market and the industrial circle. Some investors even believe that the importance of the World Model may be no less than that of today's Large Language Model in the future. Compared with the World Model, which can understand and predict the operation laws of the physical world, the Large Language Model seems "suspended."

Actually, the reason for the sudden popularity of the World Model precisely stems from the success of the Large Language Model.

Around 2023, many people believed that as long as the data volume, computing power, and parameter scale continued to increase, models like Chat GPT would eventually lead to Artificial General Intelligence (AGI). However, by 2025, people began to notice an increasingly obvious problem: Although Large Language Models are becoming smarter, there is still a certain ceiling.

Where is the ceiling? We use artificial intelligence to write articles and make summaries, and complete complex reasoning with a vast amount of knowledge. However, when the problems involve the spatial, physical, and motion laws of the real world, artificial intelligence seems a bit unreliable.

The problem probably lies in the fact that "artificial intelligence" lacks the ability to understand the real world.

What we need at this time is the so - called "Spatial Intelligence", and the World Model is regarded as one of the important paths to achieve spatial intelligence.

The World Model can be understood as a "real - world simulator" in the mind of AI. When a child sees a cup placed on the edge of a table, even if the cup hasn't fallen yet, they can roughly predict that the cup is likely to fall, break when it hits the ground, and the water inside will spill.

This ability to predict future results based on the current state and understand the operation laws of the real world is a manifestation of intelligence. What the World Model does is to learn the laws about space, time, motion, and causality.

Recently, Fei - Fei Li and the World Labs team published a long article defining and explaining the World Model, and proposed three product forms of the World Model in her opinion:

The first type is the "Renderer", which is best at answering the question: What does the world look like?

Today's well - known AI video generation models basically belong to this category. When users input a piece of text, the system can generate movie - level video footage. In terms of visual effects, they are already very impressive, even to the extent of being able to deceive the eye.

However, the problem is that these models understand "what it looks like" rather than "what it actually is." An AI - generated aerial view of a city may look extremely real, but if a car is actually driven in it, the building structure may immediately expose problems. Because the model focuses on visual rationality rather than physical rationality.

The second type is the "Simulator", which focuses on the underlying structure of the world.

The output of the simulator is not just pictures, but state information at the geometric, physical, and dynamic levels. For architects, designers, and game developers, this means they can perform real - world calculations; for robots and autonomous driving systems, it means they can be trained and tested in a virtual environment.

For example, problems such as whether a bridge will deform, whether a robot will hit an obstacle, and how a car will drive in different weather conditions can all be solved by the simulator.

The third type is the "Planner", which no longer focuses on what the world looks like or how it operates, but rather: What should we do next?

For a robot, it needs to decide whether to move forward, turn left, or reach out to grab something; for an autonomous driving system, it needs to determine when to brake, change lanes, or overtake.

The output of the planner is the action itself, so it is also an important link connecting perception and action.

In contrast, the "Large Language Model" cannot solve problems related to spatial intelligence. Taking the scenario of a child seeing a cup on the edge of a table as an example, we can ask the Large Language Model to make a prediction, and its conclusion may be correct, such as the cup will fall and the water will spill. However, this answer only comes from the data and training behind it, and it has been taught this result through multiple trainings.

What the Large Language Model learns is only the statistical laws between texts.

What will actually happen in the physical world? The World Model attempts to discover the statistical laws between space and time.

When a glass falls from the edge of a table, the Large Language Model may answer "the glass will break" based on a large amount of text experience it has seen in the past; while the World Model will internally simulate the material, weight, speed, force, and collision process of the glass, and then deduce the final result.

The former is closer to statistical inference, while the latter is closer to physical simulation.

Many researchers have begun to refer to them as "Language Intelligence" and "Physical Intelligence" respectively. Their relationship is not one of competition and they cannot replace each other; instead, they are more like parallel concepts.

Fei - Fei Li said at the end of the article: "Language enables machines to talk about the world, and the World Model will enable machines to finally understand, imagine, reason, and interact with the world."

The Large Language Model has helped AI enter the digital world, while the World Model attempts to open the door to the real world.

The World Model has long existed,

but why has it only been defined now?

Promoting spatial intelligence and the World Model is not that simple.

In the case of the "spilled water from a cup", if we need to build a World Model, we must simulate the material, weight, speed, force, and collision process of the cup. However, the real - world physical data has not been fully recorded in the computer world.

Without enough data, we cannot simulate how the water in the cup flows. So, this path is a long and arduous one.

However, humans are not just starting out.

Before Fei - Fei Li re - defined the "World Model", humans had already been doing similar things. In other words, the reason why the concept of the World Model has sparked a huge discussion today is not because it was suddenly invented, but because many capabilities that were originally scattered in different industries and disciplines are being discussed in the same framework for the first time.

The greatest contribution of Fei - Fei Li's article is to provide a unified classification method for these previously unrelated technical routes. According to Fei - Fei Li's definition, the process of establishing the world state, predicting future states, and inferring the consequences of actions essentially belongs to the category of the World Model.

Many practitioners like to joke: "We've been simulating the world for decades, and now the AI circle has finally recognized our value."

For example, engineering simulation software helps enterprises conduct fluid simulation, aircraft simulation, engine simulation, and structural analysis. Without first simulating in the computer whether the wings of an aircraft will deform during take - off, whether the engine will fail under extreme temperatures, and whether a bridge can remain stable in strong winds, it is impossible to develop products.

The same is true in the field of digital twins, where they build virtual factories, virtual cities, and virtual production lines.

The difference is that in the past, simulators highly relied on manual modeling. Engineers needed to manually input building dimensions, mechanical structures, material parameters, and various physical rules, and then gradually build a digital world.

What AI is trying to do today is different. Perhaps just by shooting a video, the model can automatically generate a three - dimensional space with geometric structures and physical properties. Therefore, many investment institutions have begun to pay attention to the World Model. The real new thing is not the simulation itself, but that AI is starting to take over the simulation process.

The "Renderer" in Fei - Fei Li's definition can even be considered a mature industry. Well - known products such as Midjourney, Sora, and Veo essentially belong to the Renderer category. They are good at answering the question "what does the world look like". In terms of effect, they are so mature that ordinary users can hardly distinguish between real and AI - generated images.

However, according to Fei - Fei Li's classification, the Renderer has inherent limitations: it solves the problem of "looking like" rather than necessarily understanding "what it actually is". In a beautiful AI - generated city, the spatial relationship between buildings may be unreasonable, and physical rules may not hold.

Fei - Fei Li constantly emphasizes the importance of the Simulator in her article. For robots, autonomous driving, and industrial systems, simply having visual realism is far from enough. They need a world that can be calculated, inferred, and verified.

As for the Planner, it is not something new either. The robotics industry has been researching planning problems for many years. For example, when Tesla's humanoid robot Optimus sees a cup through its camera, it needs to decide whether to take a step forward, extend its right hand, or adjust its body center of gravity.

This is even more so in the field of autonomous driving. Every car manufacturer invests heavily in research and development to predict whether the vehicle in front will brake, whether pedestrians will suddenly cross the road, and whether bicycles will change lanes, and then decides what to do next.

In the past few decades, each field such as the game industry, aviation industry, autonomous driving, robotics, construction industry, military industry, and weather forecasting has its own simulator.

Everyone is building their own "small - scale World Model".

Now, spatial intelligence may enable more data to be stored and linked. Each field needs to understand spatial and physical laws, and there may be a common underlying ability behind them.

Perhaps there can really be an idealized model that can predict physical changes and infer actions, providing a training environment for robots, a testing platform for autonomous driving, and serving game, construction, and industrial design.

Let's talk about the research related to the World Model from an academic perspective.

As mentioned earlier, weather forecasting has its own simulator. In fact, fluid mechanics is a typical example of studying the laws of the world.

Given certain pressure, temperature, and boundary conditions, fluid mechanics scientists can predict how air will flow; given the flow velocity of a river and terrain conditions, they can predict how a flood will spread. Without their own "World Model", weather forecasting would be impossible.

It's similar in celestial mechanics. When the positions and velocities of the sun, earth, and moon are known, scientists can calculate their orbits for the next few decades or even centuries.

However, the biggest difference between these disciplines and today's World Model lies in the method. The former relies on equations written by humans, which are the results of hundreds of years of research by scientists. In contrast, today's World Model allows machines to learn the laws from a vast amount of data on their own, rather than having humans tell the machines how the world works.

Who will benefit from the

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Uncovering Large Models: Breaking Down Li Feifei's Three Categories of World Model Products to Understand the Next Trillion-Dollar Track in AI