Comprehensive Analysis of the "World Model": Definition, Path, Practice, and a Step Closer to AGI
Today's AI seems to be "omnipotent": it can write profound papers, complex code, and produce top-tier images and videos. However, it still lacks the ability to understand the world, predict it, and reason and act within it.
To solve this problem, major companies like OpenAI, Google, and Microsoft, as well as top scholars like Yann LeCun and Li Feifei, have begun racing to research the same thing: world models.
Many AI scientists believe that as multimodality becomes widespread and mature, if this technical line is fully realized, it will completely reshape the entire AI landscape. However, we also notice that the boom of "world models" has brought new problems: it seems that the entire AI community has become "world models" overnight—whether it's video generation, robotics, autonomous driving, game development, etc., almost anything related to the "world" is labeled a world model.
What exactly is a world model, and how does it differ from large language models (LLMs)? Are these seemingly distinct approaches working toward the same goal? What changes will world models bring to various industries and society as a whole? And could they be the ultimate key to human AGI?
For this video, the Silicon Valley 101 team spent months on in-depth research, interviews, and post-production effects to explain what this field—considered by many industry leaders as "the most important AI research direction for the next decade"—is really about. We hope this helps you understand the cutting-edge discussions and developments in AI. The content is a bit technical and long, so enjoy!
01 What Is a World Model?
There is still no clear, universally accepted definition of a world model. But we can start by discussing the origin of the concept and what problem it aims to solve.
Let's begin with a simple question: How do you know a glass of water on the edge of a table might fall off?
Scientists believe that humans can predict a glass will fall, which way a door opens, or a ball will roll down a slope because from an early age, we build a model in our minds of "how the world works." We can anticipate what will happen next, imagine "what if I do this," and rehearse possibilities in our heads. In cognitive science, this is called a Mental Model.
As early as the last century, scientists began studying human mental models. In 1943, Kenneth Craik proposed in his book The Nature of Explanation that before reacting to reality, humans first build a "small-scale model of the world" in their brains to simulate possible processes and then choose actions accordingly. In other words, each of us has an invisible "small world" in our minds.
Since human intelligence relies on such an internal world, many AI researchers have asked: Does a machine need its own world to have true intelligence?
Thus, this idea reappeared under different names in early AI and reinforcement learning research. For example, in 1991, Richard Sutton, Doina Precup, and Satinder Singh proposed the design idea later known as the Dyna architecture in their paper An Integrated Architecture for Learning, Planning, and Reacting.
The core of Dyna is: While learning an action strategy, an agent must also learn a model of the world. That is, how the world changes after taking a certain action—this was the first time "world model" was explicitly established as a fundamental internal capability of an agent.
Since then, world models have not developed along a single path but have been continuously disassembled, strengthened, and rewritten in different research fields. For example, in reinforcement learning and robotics, it manifests as Forward Model; in automatic control and industrial systems, it evolved into Model Predictive Control (MPC).
Although these theories have different names, they share the same core assumption: An agent makes better decisions not because it reacts faster, but because it can "see the future" in its internal world before acting.
For a long time afterward, world models remained mostly theoretical and algorithmic until deep learning and representation learning matured. In 2018, David Ha from Google Brain and deep learning pioneer Jürgen Schmidhuber co-published the paper World Models. This paper formally proposed the refined name "World Models" and provided a concise framework for understanding them:
World Model = Observe the World (V) + Predict the World (M) + Learn to Act in the Internal World (C), corresponding to three core modules: Vision, Memory, and Controller.
Let's use a simple example to explain: Imagine you're a beginner who has never played table tennis. When you stand at the table, your eyes receive a lot of complex visual information. The Vision module (V) doesn't remember every pixel but automatically extracts what's truly important for decision-making, compressing millions of pixels into an essence code of just a few dozen numbers.
Upon receiving this code, the Memory module (M) immediately starts internal simulation. After multiple practices, your brain has built an understanding of the laws of table tennis movement. The Memory module is like an internal "physics engine" that can predict "what will happen if I do this."
So when the ball comes, the Vision module extracts features, the Memory module simulates solutions, and the Controller module (C) is mainly trained in the "internal world" created by the Memory module (M). You don't need to swing the racket a hundred times to trial and error—instead, you find the optimal strategy in the Memory module's "dream" and then execute only the best solution in reality. This cognitive process of "imagine-plan-act" is the core feature of human intelligence.
In this paper, they also created an interesting demo: the model learned to play a racing game in a completely virtual small world, proving that AI can learn through imagination in its internal world, just like humans.
To summarize, researchers generally agree that world models should have three key traits:
1. Represent the World (Representation). The model can understand what's in the environment, where objects are, and the relationships between them.
2. Predict the Future (Prediction). It can simulate and generate events—what changes will occur if I push a glass, open a door, or take two steps forward.
3. Plan and Act in the World (Planning & Control). Once it can predict what will happen next, it knows how to act.
Yiqi Zhao
Product Design Lead, Meta
It abstracts the world into a latent, compressed space. In this latent space, you can use learned physical laws to predict the future, forming a simulator of the real world. It's equivalent to a simulation system, somewhat like a miniature parallel universe. It feels like if you have a real AI brain, it has its own AI worldview. Because it can make predictions, it can reason about the future and make decisions.
The essence of world models is to transform AI from a language machine that "only answers questions" into a true agent that can "observe, reason, and act" like humans. But here's the question: As a concept studied since the last century, why has it suddenly become popular recently? What's the difference or connection between it and the LLMs we're familiar with now?
02 Why Study World Models?
Chapter 2.1 Differences Between World Models and LLMs
From the perspective of main tasks and prediction targets:
The goal of LLMs is to generate the most reasonable sequence in the language dimension, predicting the next word or token. For example, if you ask "Will the glass fall off the table?", it answers "Yes" because that's the correct answer in countless texts.
The task of world models is to predict "what the world will look like in the next second", predicting the next frame, next action, or next state change. It needs to understand physical laws, spatial relationships, and dynamic changes.
From the perspective of training data:
LLMs mainly rely on text data, including some images and videos, with data characterized by static content.
World models mainly rely on dynamic data like videos, including camera footage, robot sensor feedback, action results, and environmental changes, with data characterized by dynamic, temporal sequences.
From the perspective of output results:
LLMs output content like language or images.
World models output predictions of future states, simulations of behaviors, and executable action plans.
From the perspective of learning methods:
LLMs understand the world indirectly through language, more like a "knowledge container."
World models understand the world directly through interaction and reasoning—they can not only "see" but also "predict" and "intervene."
Therefore, LLMs are more suitable for dialogue, writing, translation, and Q&A. World models are more suitable for tasks that must enter the real world, such as robotics, autonomous driving, physical simulation, and decision systems.
Previously, Li Feifei also succinctly summarized the differences in purpose and training modalities between the two in an interview:
Li Feifei
Founder of World Labs, Senior AI Scientist
One is about expression, the other is about observation and action. Therefore, they are fundamentally different modalities. The basic unit of large language models is the vocabulary—whether letters or words—while the basic unit of the world models we use is pixels or voxels.
Chapter 2.2 Has the LLM Line Hit a Bottleneck?
Although LLMs and world models are different technical routes, their ultimate goal is to achieve general artificial intelligence (AGI). So why are we suddenly paying so much attention to world models now? Is it because the LLM route has reached a dead end?
There are still different views in the research community on this question.
Some researchers clearly state that LLMs are a dead end, and one of the representatives of this is Yann LeCun.
Image source: Reuters
After leaving Meta where he worked for 12 years, this 65-year-old Turing Award winner and deep learning pioneer did not choose to retire but returned to Paris to found a company called Advanced Machine Intelligence. What he wants to do is completely different from the mainstream large model route in Silicon Valley.
In a recent interview, he said that Moravec's paradox has always existed in the AI field. Moravec