HomeArticle

Introduction to the Concept of World Models: A Story That Spread from Psychology to the Main Battlefield of AI

IT桔子2026-06-29 12:57
There is no unified name, but everyone is competing for it: An Introduction to World Models

The world model is currently the hottest yet most confusing concept in the AI circle for ordinary people. Some say it's the ability that enables AI to "dream," others claim it's a simulator for autonomous driving, and still, others believe it's the brain of a robot.

Fei-Fei Li, Yann LeCun, OpenAI, Google DeepMind, NVIDIA, and even domestic companies like Alibaba, Tencent, Huawei, and automakers each have their own definitions.

This article attempts to explain in plain language:

What problems the world model aims to solve; why these scholars and big companies are so fascinated by it; and why this concept has become a must - fight territory in the industry even before its name is unified.

1. Understanding in One Sentence: Let AI Pre - enact the World in the "Mental Sandbox"

Imagine you're standing at a crossroads, ready to cross the street.

Your eyes see the green light, vehicles, and pedestrians. Your brain will construct a mini - scenario within a fraction of a second: If I start walking now, will that car speed up? Will that cyclist suddenly turn?

You don't actually step out but first go through several possibilities in your mind.

Psychologists call this ability a "mental model," while AI researchers refer to it as a "world model."

In other words, the world model is a "mental sandbox" in the machine.

It doesn't simply recognize what's in the picture but can predict what will happen next and repeatedly test for errors without actually taking action.

For autonomous driving, it can generate virtual tests for heavy rain, blizzards, and unusual obstacles; for robots, it can allow humanoid robots to fall tens of thousands of times in a simulated world before going out; for game and film companies, it might be a parallel universe that can be infinitely explored.

In 2026, the frequency of the term "world model" appearing in technology reports has exceeded the clarity of its definition.

Alibaba has developed Qwen - AgentWorld, HappyOyster, and Qwen - RobotWorld, targeting the language world, virtual world, and physical world respectively; Tencent's HY - World 2.0 emphasizes a 3D editable world; NIO, XPeng, and Li Auto prefer to talk about the "driving world model" or "world behavior model"; Huawei and Baidu don't often use this term alone in their public materials.

The naming chaos makes this concept seem like a catch - all basket.

But there's a common core behind all these names:

Let the machine build an environment that can be deduced and reviewed internally before taking real action. This environment can be pixels, 3D structures, physical parameters, or abstract states. The goal is to reduce the infinite reliance on real - world data and compress the real world into a data engine that can generate infinitely, make mistakes infinitely, and start over infinitely.

The non - unified names precisely indicate that the world model is in the early stage of transitioning from an academic concept to an industrial infrastructure.

2. The Origin of the Idea: A WWII Psychologist and Several AI Pioneers

2.1 Kenneth Craik: The First to Mention the "Mental Mini - Model"

The idea of the world model predates deep learning by more than half a century. In 1943, Scottish psychologist Kenneth Craik proposed in his book The Nature of Explanation that the human brain constructs "small - scale models" of reality to predict and understand external events.

Craik was only 31 years old at that time. He was a scholar at the Cambridge University Psychology Laboratory and also engaged in applied psychology research in the UK during World War II.

He died in a bicycle accident two years after his book was published, at the age of 33.

But this idea was retained: Humans don't need to fully replicate the world. They only need a good - enough internal model to pre - enact before taking action.

This view is almost consistent with the core of today's AI world model. Machines also don't need to remember every detail of the world but learn the laws of the world's operation and deduce the future when needed.

After Craik, in the 1980s, British psychologist Philip Johnson - Laird further systematized this set of ideas, proving that a large amount of human reasoning is actually about manipulating the "mental models" in the brain. He taught at Princeton and Cambridge for a long time and is an important figure in the field of cognitive science.

2.2 Marvin Minsky: The One Who Wanted Machines to Have a Common - Sense Framework

The field of artificial intelligence also had an early echo. In the 1960s, Marvin Minsky proposed the "frame theory" at the Massachusetts Institute of Technology.

He was a co - founder of the MIT AI Laboratory and the winner of the Turing Award in 1969. He is often regarded as one of the founders of the artificial intelligence discipline.

The frame theory attempts to capture human common sense about the world with a structured knowledge framework:

When entering a door, you need to find the doorknob first; there are usually tables and chairs in a restaurant; objects fall under the influence of gravity.

What Minsky wanted to do is exactly what the world model has not yet accomplished today - to enable machines to have a structured and deducible world common - sense library.

2.3 David Ha and Jürgen Schmidhuber: Bringing the World Model Back to the Mainstream of Deep Learning

The field of reinforcement learning approached the same goal from another path.

In 2018, David Ha and Jürgen Schmidhuber published a paper titled Recurrent World Models Facilitate Policy Evolution at NeurIPS, bringing the term "world model" back to the mainstream of deep learning.

David Ha was working at Google Brain at that time and later became an independent researcher. His work style is more engineering - oriented, and he is good at creating amazing demos with a simple architecture.

Jürgen Schmidhuber is a co - founder of the Swiss AI laboratory IDSIA and one of the inventors of the long - short - term memory network LSTM. He is known for his outspokenness and independent views in the AI field. He is sometimes called the "father of modern AI," and although this title is controversial, his academic influence is beyond doubt.

Their architecture is very simple:

Use VAE to compress high - dimensional images into low - dimensional latent vectors, use RNN to learn the changes of these vectors over time, and then use a simple controller to train strategies in "imagination."

The agent first "dreams" in the learned world model and then transfers the strategy back to the real environment.

This paper was selected for an oral presentation at NeurIPS, directly inspiring the subsequent Dreamer series and turning the "world model" from a psychological concept into an engineering goal in deep learning.

3. World Models in the Eyes of Scholars

3.1 Yann LeCun: Don't Just Generate Videos, Understand Physics

Yann LeCun is French, a professor at New York University, and the chief AI scientist at Meta.

He is one of the inventors of the convolutional neural network (CNN). In 2018, he, along with Geoffrey Hinton (Fei - Fei Li's doctoral advisor) and Yoshua Bengio, jointly won the Turing Award. The three are known as the "Three Giants of Deep Learning."

LeCun has always been critical of the current path of large language models. He believes that just predicting the next word cannot produce real intelligence.

In 2022, in an article titled A Path Towards Autonomous Machine Intelligence, he proposed that real intelligence requires a configurable predictive world model.

The goal is not to generate text or images but to understand the laws of the physical world and predict the consequences of actions. He even criticized the continuous stacking of large language models as "nonsense" and believed that the core of intelligence lies in learning the physical structure of the real world.

JEPA is the technical carrier of this route. JEPA stands for Joint Embedding Predictive Architecture.

Unlike predicting the next frame in the pixel space, JEPA simulates the changes in the world state in the abstract representation space.

For example: A video generation model is like painting the next picture, while JEPA is like "feeling" what will happen next in the mind.

I - JEPA in 2023, V - JEPA in 2024, LeJEPA in 2025, and LeWorldModel in 2026 form a continuously evolving system.

LeCun also introduced the concept of "System 1 / System 2": System 1 is an intuitive and rapid response, while System 2 is to use the world model for deliberate reasoning and planning.

The latest theoretical work even proves that under certain conditions, the representation learned by JEPA can establish a linear correspondence with real physical variables, that is, the model has learned the physical structure in a mathematical sense, not just a useful encoding.

3.2 Fei - Fei Li: Classifying World Models with the "Action - Observation" Loop

Fei - Fei Li is a professor of computer science at Stanford University and the main creator of the ImageNet dataset. ImageNet triggered the deep - learning revolution in 2012, and she is therefore called the "Godmother of AI."

She was the chief scientist of Google Cloud AI and founded World Labs in 2023, focusing on spatial intelligence and 3D world models. In 2024, she received multiple honors for promoting AI democratization and applications in fields such as healthcare. She is one of the most influential Chinese scientists in the current AI field.

In June 2026, Fei - Fei Li and the World Labs team published a widely - reposted article, attempting to establish a taxonomy for the chaotic concept of world models.

She cited the POMDP in reinforcement learning, which is the "Partially Observable Markov Decision Process."

This concept sounds complex, but it actually describes a very simple cycle: The agent takes an action, the action changes the world state, the agent obtains an observation, and then takes the next action based on the observation.

She pointed out that all systems called world models are essentially projections of this cycle in different directions, and each type only outputs a segment of the cycle.

Based on this, she divided world models into three categories.

The first category is the renderer, which outputs observations, that is, pixels for the human eye to view. Typical representatives are video generation models and Google Genie 3, and the optimization goal is visual fidelity.

The second category is the simulator, which outputs states, that is, a faithful representation of the world at the geometric, physical, and dynamic levels. Typical representatives are NVIDIA Omniverse and World Labs' Marble, and the optimization goal is structural accuracy.

The third category is the planner, which outputs actions, that is, answers "what to do next" given an observation and a goal. Typical representatives are VLA and World Action Models.

Fei - Fei Li believes that the underlying knowledge of these three types of capabilities is the same, and the ultimate trend is towards a unified world model.

3.3 Tsinghua FIB - Lab: There Are Only Two Types of World Models, Understanding the World or Predicting the Future

The FIB - Lab at Tsinghua University is a team that has long been researching general artificial intelligence, embodied intelligence, and robot learning. FIB is usually understood as a laboratory related to "future intelligence and the brain" and is affiliated with the Institute for Artificial Intelligence at Tsinghua University.

This team has published a large number of reviews and papers in the fields of world models and robots and is one of the important forces in domestic research in this direction.

In 2026, they published a review titled Understanding World or Predicting Future: A Comprehensive Survey of World Models, dividing this field in another way.

They divided the core functions of world models into two major categories: understanding the world and predicting the future.

Understanding the world emphasizes constructing an implicit representation of the external environment to support decision - making, with the Dreamer series and world knowledge based on large language models as representatives.

Predicting the future emphasizes explicitly generating future states, with typical examples being video or 3D environment generation models such as Sora, Genie 3, and Cosmos.

The advantage of this classification is that it is closer to engineering practice: the former serves reinforcement learning and decision - making, while the latter serves generation and simulation.

3.4 Peking University OpenWorldLib: Creating a Standardized Toolbox for World Models

In April 2026, Peking University, in collaboration with institutions such as Kuaishou, released OpenWorldLib. Peking University is a major base for basic artificial - intelligence research in China, with institutions such as the Key Laboratory of Machine Perception (Ministry of Education). Kuaishou is a domestic short - video giant that has invested heavily in large models and multimodal generation in recent years.

The joint release of OpenWorldLib by the two shows that both the academic and industrial circles have begun to realize that world models need unified standards and reusable components.

OpenWorldLib made the first attempt to give a standardized definition of the world model: a model or framework centered on perception, with interaction and long - term memory capabilities, used to understand and predict the complex world.

They criticized the simple equivalence of the world model to "predicting the next frame" as too narrow and believed that a real world model must reflect a true understanding of physical laws.

OpenWorldLib divides the world model into five core modules: operator, synthesis, reasoning, representation, and memory, which are then coordinated by a pipeline module.

This framework is more like a toolbox, aiming to allow different research teams to combine modules like building with Lego bricks.

4. World Models in the Eyes of Big Companies

4.1 OpenAI: Sora is a "World Simulator"

OpenAI is one of the most influential AI companies in the world. It is well - known for the GPT series of large language models and ChatGPT. After releasing Sora in 2024, it once again attracted global attention to video generation and world simulation.

In February 2024, OpenAI released a technical report on Sora titled Video Generation Models as World Simulators, directly positioning the video generation model as a world simulator. Sora does not rely on explicit 3D modeling or a physical engine but trains a generation model on large - scale video data, enabling it to spontaneously develop capabilities such as 3D consistency, long - term consistency, object persistence, and simple world interaction.

OpenAI believes that the large - scale expansion of video generation models is a very promising path for building a general simulator of the physical world.

But the limitations of Sora are also obvious: it cannot accurately simulate basic physical processes such as glass breaking, there will be inconsistencies in long - term samples, and objects may appear uncontrollably. So it is more of a directional declaration than a mature definition.

4.2 Google DeepMind: Genie 3 is a Real - Time, Interactive General World Model

Google DeepMind was formed after Google acquired the British AI company DeepMind in 2014. Demis Hassabis is the co - founder and CEO.

DeepMind has developed milestone systems such as AlphaGo and AlphaFold and is one of the frontiers of global AI research. Demis Hassabis himself is a computer scientist, neuroscientist, and game designer, and has long been concerned with general artificial intelligence.

In August 2025, Google DeepMind released Genie 3, which is officially defined as "the first real - time, interactive, and realistic world model."

It can generate an explorable 3D environment based on a simple text description, with a frame rate of 20 - 24 fps, supporting character control, promptable world events, and up to one - minute interaction memory. Genie 3 uses an autoregressive method to generate frames one by one, anchors to the real world based on Google Maps street - view data, and is positioned as a key milestone towards AGI.

4.3 NVIDIA: Cosmos is the "World Foundation Model" for Physical AI

NVIDIA was founded in 1993 by Jensen Huang, Chris Malachowsky, and Curtis Priem, and Jensen Huang has long served as the CEO. The company initially started with graphics chips (GPUs) and has become the core supplier of global AI infrastructure in the past decade due to the explosive demand for computing power in AI training.

Jensen Huang has frequently made judgments such as "physical AI" and "the next wave of AI is robots" in recent years, and NVIDIA has continuously launched software and hardware platforms for robots, autonomous driving, and simulation.

In January 2025, NVIDIA released Cosmos, which is positioned as a "world foundation model platform." It is not a single model but a series of physical - perception video models that can predict and generate the future states of virtual environments, divided into three levels: Nano, Super, and Ultra, and trained on