Stop Misrepresenting the World Model Valued at $1 Billion: Li Feifei Teaches You to Distinguish Them Step by Step

"World model" is one of the most important yet most overused terms in the field of AI today.

In the past 18 months, over $10 billion in capital has flowed into world model and robotics AI companies. A notable pattern is that the financing scale obtained by companies using world models even exceeds that of companies specifically building world models themselves.

There is no doubt that world models have become popular. However, there have been diverse opinions about its actual concept, leaving people confused.

This morning, Fei-Fei Li and the World Labs team published a long article titled "Functional Classification of World Models". She said bluntly that "world model" has become one of the most important and overused terms in the current AI field. Last month, Henry Yin and Naomi Xia of MoE Capital also stated in their blog that most things labeled as "world models" are not real world models at all.

At this moment, Fei-Fei Li's article provides a rare clear framework. By introducing the classic structure in reinforcement learning, it fully explains the definition of "world model" and functionally classifies the current complex generative models, physical simulation systems, and embodied intelligence methods into three types of world models: "renderers, simulators, and planners".

For the AI industry, which is currently in the stage of route differentiation and capital competition, this is not only a technical classification but also like a roadmap for future dominance. Under this classification, different originally independent technical paths are compared in a unified coordinate system for the first time. Fei-Fei Li also pointed out that the three are beginning to integrate with each other: "When their boundaries disappear, they will jointly reshape something more grand: the relationship between machine intelligence and the physical world it is in, which is the long-term evolution trajectory of spatial intelligence."

In her view, "the end goal is a unified world model: a foundation model that can render photo-realistic views, generate physically accurate structures, plan action sequences, and switch between different output modes according to downstream needs."

At the end of the article, she pointed out that "language enables machines to talk about the world. And world models will enable machines to finally understand, imagine, reason, and interact with the world." The underlying judgment is also quite clear: what really determines the upper limit of AI in the next stage is not the model that is better at "talking" but the "simulation ability" closer to physical reality.

The following is a compilation of the original text. We have edited it without changing the original meaning.

The World Is Not Composed of Language

In a previous article, we argued that spatial intelligence is the next frontier of artificial intelligence, and world models are the path to this goal. Here, the World Labs team and I hope to delve deeper: among the many things currently being built and called "world models", which functional components truly constitute this ability, and what is each part used for?

Language models endow machines with extraordinary control over concepts, vocabulary, and reasoning. However, whether in the virtual or real world, the physical world operates on a completely different underlying structure. Language models learn the statistical structure of text, while world models learn the statistical structure of space and time: how light falls on surfaces, what a garden looks like from an angle never captured by a camera, and how objects respond to forces and follow physical laws.

This makes "world model" one of the most important and overused terms in the current AI field. Computer vision, robotics, reinforcement learning, and generative AI all claim to be building world models, but they each refer to completely different things. A video model that can generate gorgeous but physically impossible flames, a language model that can improvise playable games, and a physical engine that faithfully simulates the combustion process are all called by the same name.

The ancient Greeks never reached an agreement on what the world is composed of, whether it is fire, water, or indivisible atoms, because the "world" has never been a single thing. It has always been just a substitute concept used to refer to the whole that a thinker needs to reason about. AI inherits the same problem at this moment, precisely when the field most needs precision.

The Cycle Under Classification

To sort out this confusion, we can start with a schema older than any of the above technologies. Reinforcement learning textbooks, including the classic works of Sutton and Barto, have used similar diagrams for decades to describe how agents interact with the world. The formal name of this diagram is the "Partially Observable Markov Decision Process" (POMDP), and the term "world model" originally originated from this tradition.

An agent can be a human, a robot, or a software system that takes actions. These actions affect the state of the world. The agent can never directly see the state. What it receives are observations: photons falling on the retina, sensor readings, and pixels in video frames. New observations lead to new actions, and the cycle continues.

The term "state" needs further explanation because its meaning varies in different fields. Here, it does not refer to the states in chemistry (solid, liquid, gas) but to the states in physics and robotics: a complete description of everything happening in the world at a certain moment, including every object, every position, every velocity, and every attribute. The state is the underlying reality of the world; in principle, it is complete, but it is not directly observable to any agent in it. Observations are the partial views of this reality by the agent. Actions are the responses of the agent to this.

This cycle from the agent to actions to the state, then to observations, and back to the agent forms the structural basis of the modern term "world model". The phrase itself dates back earlier, to the view put forward by Kenneth Craik in 1943 that the mind reasons by running "small-scale models" of reality; this idea was introduced into the field of neural networks from the late 1980s to the early 1990s. This cycle also explains how people use this term today: the different things called world models today are actually different projections of this cycle, each outputting different parts of it.

Three Types of Functions of World Models

The first type of world model is the "renderer". The renderer outputs observations in the form of pixels for the human eye to view, and its most important indicator is visual fidelity. A video model that converts text prompts into movie-level aerial shots is a renderer. Interactive systems like Google's Genie 3 or World Labs' own RTFM are also the same. They can generate images in real-time based on user input. These models do not have an explicit understanding of three-dimensional structures. They generate "what it looks like" rather than "what it actually is". The buildings in aerial shots may look perfect from above, but once you try to drive in the city, these structures will collapse.

The second type is the "simulator". The simulator outputs the state: a representation that is geometrically, physically, or dynamically faithful to the world, which both humans and computer programs can calculate and interact with. The contract of the renderer is purely visual, while the contract of the simulator is structural. It requires that the geometry holds under inspection, the physics follows Newton's laws, and the dynamic behavior conforms to what the world should be like under physical laws. The simulator serves two types of objects: one is human professionals, such as architects, designers, filmmakers, and game developers, who need precision beyond visual plausibility; the other is computer programs, such as reinforcement learning agents, robot controllers, and autonomous driving systems, which use the simulator as a training environment to interact with the world on a large scale and test scenarios that are dangerous, expensive, or impossible to execute in reality.

The third type is the "planner". The planner outputs actions. Given observations and goals, the planner answers what the agent should do next. In many ways, it is the reverse process of the renderer: the renderer takes actions as input to generate observations, while the planner takes observations as input to generate actions, thus closing the perception - action cycle. Vision - language - action models, model-based methods, and the new generation of World Action Models are all trying to build planners, systems that can decide what a robot should do in an unstructured world.

These three types cover most of the currently implemented systems, and this distinction is also useful in practice. However, they are not fundamentally independent of each other. The same underlying knowledge about how the world works, including geometry, physics, and dynamics, supports all of them. A model that can render a cup from any angle should, in principle, also be able to simulate what will happen when the cup is pushed and plan for a hand to pick it up. More and more interesting research is deliberately blurring the boundaries between these three.

Why Is Simulation the Key?

Among these three types, the simulator receives the least public attention but is the most decisive of the three. This article aims to discuss this asymmetry.

The renderer is the most commercially mature. A large number of products that generate videos from images or text are rapidly expanding in the consumer and enterprise markets. Google's Nano Banana model has brought high-quality image generation capabilities to hundreds of millions of potential users. The technology is real, and the market is also real. However, the renderer optimizes for visual plausibility rather than physical accuracy, and this upper limit is very important. Their outputs are beautiful but cannot be used for designing buildings or training robots.

The planner is the most attractive and also in its early stage. It is closely related to the rapidly developing field of robot learning. In the past two years, this field has shown many impressive robot demonstration videos, but we need to be honest about the actual meaning of these demonstrations. Almost all demonstrations are limited to highly controlled laboratory environments, using a limited set of objects and short task cycles. No system has been verified in terms of the complexity, variability, or duration required for real-world deployment. There is still a huge gap between the amazing demonstrations and robots that can work reliably in kitchens, warehouses, or operating rooms. Nevertheless, the commercial investment is still huge. A group of well-funded new entrants are competing to launch general planning systems, while the largest infrastructure players are deploying planning capabilities on a broader simulation system. A robot that can plan is a robot that can work, and the entire industry is competing for this goal.

Simulation is the bridge connecting the two. If language is an abstraction of the world and pixels are a projection of the world, then geometry, physics, and dynamics are the world itself. The simulator must operate at this level: it is a structural framework from which both the visual appearance (for the renderer) and the action results (for the planner) can be derived.

A model that has mastered the simulation ability can project its understanding into pixels for human use and action predictions for embodied agents. However, a model that only masters rendering or only planning cannot do these two things. Its commercial space is huge. Just NVIDIA's Omniverse is targeting a potential market estimated to exceed one trillion dollars by the company, covering factories, warehouses, supply chains, and digital twins. Fields such as robot training, autonomous driving testing, building visualization, engineering design, and drug discovery all rely on some form of simulation.

The most difficult open problems in this field also concentrate here. Three-dimensional data with clear geometric, material properties, and physical annotations is much scarcer than the Internet videos relied on by renderers. The "simulation to reality" gap still exists, that is, the difference between the behavior in simulation and the behavior in reality. On this basis, generative simulators also introduce new risks: the AI-generated geometry may look correct but contain self-intersections or scale errors, resulting in meaningless physical behavior. Large-scale multi-physics simulations of the interaction between rigid bodies, deformable objects, fluids, and fabrics are still several orders of magnitude more computationally expensive than single-domain simulations.

At World Labs, our Marble is the first step into this field. It can accept multi-modal prompts (text, images, videos, or spatial sketches), generate an explorable three-dimensional environment, and simultaneously output Gaussian splats for visual exploration and collision meshes for the physical engine. However, Marble is just a beginning, and the entire field is writing a longer trajectory, and the boundaries between rendering, simulation, and planning are gradually disappearing.

What Will Happen Next as the Boundaries Collapse?

There will be more developments in the future. The most important current trend in this field is that these three types are beginning to integrate with each other. The common insight is that the knowledge required to render the world, simulate the world, and act in the world is essentially the same. Continuing with the previous example, a model that truly understands how a cup is placed on a table (including its geometry, material properties, and response to forces) should be able to render the cup from any angle, simulate what will happen when it is pushed, and plan for a hand to pick it up. These three types are actually three projections of the same underlying understanding.

For example, some recent work from multiple robot laboratories has shown that, at least in concept, a pre-trained video renderer can serve as the basis for joint world and action prediction, thus building a bridge between the renderer and the planner, allowing the same model to both imagine what will happen and decide what to do. World Labs' Marble can already output both Gaussian splats and collision meshes from a single model, thus breaking the boundary between the renderer and the simulator. Each layer is shifting from passive output to interactive systems: the renderer becomes controllable by actions, the world generated by the simulator becomes more controllable and editable, and the planner shifts from simple reactions to more reasoning-based decision-making.

The logical end goal is a unified world model: a foundation model that can render photo-realistic views, generate physically accurate structures, plan action sequences, and switch between different output modes according to downstream needs. Of course, we will still face many challenges. The data distribution is extremely unbalanced: the renderer has a large amount of Internet videos, while the simulator and planner severely lack three-dimensional assets and robot demonstration data. Optimizing for visual aesthetics may sacrifice the precision required for robots or high-fidelity simulations. Reconciling these tensions in the same architecture is the most core open problem in current world model research and the direction that World Labs is trying to solve in the process of advancing Marble.

The direction is very clear. Since the late 1980s, the field has been betting that as long as there are sufficiently rich world models, agents will be able to observe the world, build the world, and act in it. Today, this "big bet" is driving a new generation of research, and its power comes from the ongoing integration: three originally independent research paths, each of which has supported multi-billion-dollar industries, are starting to act as a whole. When their boundaries disappear, they will jointly reshape something more grand: the relationship between machine intelligence and the physical world it is in, the long-term evolution trajectory of spatial intelligence.

Language enables machines to talk about the world. And world models will enable machines to finally understand, imagine, reason, and interact with the world.

Reference Link

https://x.com/drfeifei/status/2062247238143996275

This article is from the WeChat official account "AI Frontline", author: Hua Wei. Republished by 36Kr with authorization.