Li Feifei's Manifesto for World Models
"The world is all that is the case."
In 1921, Ludwig Wittgenstein wrote this famous quote in Tractatus Logico - Philosophicus. A century later, this quote was cited by Fei - Fei Li, one of the leading figures in AI, as the opening of the latest technical blog.
In the realm of deep learning, people have gotten used to AI's overwhelming superiority in language in the past three years, starting from the moment ChatGPT endowed machines with expression, programming, and reasoning abilities far beyond those of humans.
However, behind the digital miracle, a blind spot is often overlooked: Machines can talk about the world, but they know nothing about the physical nature of the world. The blog post published by Fei - Fei Li seems like a moment of sobering.
Today, when generative AI has become an essential tool globally, the definition of the "world model" within the industry is becoming increasingly chaotic. Whether it's video generation or embodied intelligence, various companies are vying for the right to interpret this concept.
After Fei - Fei Li published this blog post, many people thought she was trying to regain the right to define the "world model." On the contrary, I think what Fei - Fei Li really wants to do is to issue a declaration to people: The world is not composed of language, but of strict physical space and time laws.
If machines want to truly enter the human physical world, they must break free from the comfort zone of text statistics and instead understand the refraction of light, the inertia of objects, and the logic of collisions. This is not only a technological paradigm shift but also the inevitable path for AI to move towards embodied intelligence.
01
People Need a Set of Classification Methods
It must be admitted that in the AI dictionary, the "world model" has become a catch - all pronoun, and any project related to generating images and simulating environments seems to be associated with it. This ambiguity stems from people's multi - dimensional needs for the definition of the "world."
When a technology is just starting out, there is naturally no unified law to confine it within a clear boundary. This kind of chaos in the definition of the "world model" is not uncommon in history. When ancient Greek philosophers debated whether the essence of the world was water, fire, or indivisible atoms, they were actually looking for a foundation for their reasoning.
The same problem is encountered in the field of AI: When the output of a video - generation model is extremely realistic visually but completely impossible in terms of physical laws, how should people define it? Fei - Fei Li's blog mentions an old and robust definition basis: the Partially Observable Markov Decision Process (POMDP).
This is also the core axiom of the reinforcement learning mechanism, which reveals an eternal closed - loop of an agent's interaction with the physical world: the agent takes an action (Action), which causes a change in the world state (State). However, the agent does not have a god's - eye view and can only build a partial perception of reality through observation (Observation).
The so - called world model is essentially an abstract model of the world constructed by the machine in its "brain" to survive in this closed - loop. If any link in this closed - loop is not clearly defined, then the so - called world model is still just a blind stack of pixels.
02
The Three Pillars of Building Intelligence
This closed - loop sounds simple, and the function of each link is easy to understand. However, if analyzed carefully, there are countless details with unclear definitions inside. To explain the chaos, Fei - Fei Li disassembled the world model into three core components, which are not only technical classifications but also the three pillars for AI to reach embodied intelligence.
1. Renderer
The core logic of the renderer is visual rationality. Its output is pixels, and it is committed to making the picture appear natural, coherent, and beautiful in human eyes.
This is also the most commercially mature field at present. Well - known video - generation models like OpenAI's Sora and ByteDance's Seedance 2.0, and image - generation models like OpenAI's GPT - image - 2 and Google's Nano Banana 2 are essentially the most sophisticated visual probability machines at present. They have mastered the distribution laws of light, shadow, and form by learning hundreds of millions of Internet pictures and videos.
Despite this seemingly beautiful reality, Fei - Fei Li points out that there is a price to pay. Although these top - tier models can generate magnificent buildings, if one tries to interact within the generated physical structures, the buildings will most likely collapse instantly due to the lack of supporting structures. In other words, they do not understand what "support" means, and what they generate is only what the audience "sees," not what the world "is."
2. Simulator
What the simulator pursues is precisely the structural loyalty that the renderer lacks. It doesn't care at all whether the video looks good or not. The only thing it cares about is whether the world follows physical laws. When a simulator outputs an ordinary cup, it must also include the cup's mass distribution, material friction coefficient, gravitational response, and physical boundaries during collisions.
Only with a simulator can the content in the video be considered real. However, the simulator is not only seriously underestimated but also often ignored in today's AI wave.
From the example of the cup above, the existence of the simulator turns "discussing art" into "studying physics." To construct a simulator that strictly conforms to physical laws requires unimaginable computing resources and annotation costs. But for robots, visual beauty is almost a useless attribute, and physical accuracy determines everything.
If the simulator is not accurate enough, the robots trained in it will never be able to enter the real world. The Sim - to - Real challenge objectively exists. A test action that passes 100% in the laboratory can be completely paralyzed by a tiny amount of friction in the real world. This is what we often call the "Moravec's Paradox."
3. Planner
The planner is responsible for action output. As the connection point between perception and feedback, it needs to solve the core problem of "what to do next," which never has a standard answer. In Fei - Fei Li's framework, this is also the last link in the entire "perception - action" closed - loop and the most cutting - edge challenging field.
All current Visual - Language - Action (VLA) models are trying to enable the system to make decisions in an unstructured and complex world. The planner not only predicts the future but also selects the most effective path to achieve the goal among countless possibilities. It is the key for machines to evolve from "observers" to "practitioners."
03
The Hub Worth Billions of Dollars
Among the three classifications given by Fei - Fei Li, the models corresponding to the renderer and the planner are already relatively common; the remaining simulator, naturally, has become the most difficult part to achieve. Fei - Fei Li also gives an insightful judgment: The simulator is the link connecting rendering and planning and is also the core hub of the entire system.
The one that has done the best in the field of simulators is not OpenAI, Anthropic, or Google, but NVIDIA led by Jensen Huang.
NVIDIA's Omniverse claims to be able to support the dream of trillion - level digital twins because it has grasped the essence of the simulator. On NVIDIA's platform, the operations of factories, supply chains, and warehouses have become complete digital mirrors. For the industrial community, this is no longer just a visual demo, but the core infrastructure of productivity.
This is not an exaggeration but a trillion - dollar market opportunity in front of everyone.
From the virtual visualization of construction engineering, to the molecular dynamics simulation in the pharmaceutical industry, to the scenario testing of autonomous driving. These industries lack not vivid image - and video - generation models, but a high - fidelity simulator. Without exaggeration, mastering the simulation ability of the physical world is equivalent to obtaining a priority ticket to AI industrialization.
However, the difficulties in reality make there almost no technological optimists in this field. Fei - Fei Li also admits that a huge gap always exists.
First of all, there is the problem of embodied intelligence data that we have repeatedly mentioned before. There are countless video data on the Internet, but 3D data with clear geometric structures, material properties, and physical feedback annotations are extremely scarce.
Secondly, the application of generative AI is always accompanied by hidden risks. The geometric models generated by AI can at most achieve visual perfection, but they are often unreasonable in terms of physical structure, such as the intersection of a cup and a table or the loss of volume in object collisions. In human terms, the two - word "clipping" can summarize these strange phenomena, but in real industrial applications, this means disaster.
04
Towards a Unified World Model
Despite the many difficulties, Fei - Fei Li still gives a positive prediction of the industry trend: The boundaries between rendering, simulation, and planning are becoming increasingly blurred.
This is not a beautiful vision but a reality that is already happening. After exploration, Fei - Fei Li's World Labs team believes that humans are moving towards a unified foundation model. In this architecture, imagination and logic can be combined.
The future model will no longer be a simple superposition and patchwork of single functions but a unified neural network foundation. It can render a realistic scene through Gaussian sputtering on one hand and generate the collision grid required by the physical engine in real - time on the other. Simply put, the unified foundation model will achieve seamless switching between the visual mode required by humans and the state mode required by the physical engine.
From another perspective, traditional models are static, while future world models will have stronger interactivity. The renderer will no longer be a passive video generator but will gradually start to accept action instructions; the simulator will become more editable and controllable; the planner will also be able to think logically and automatically adjust strategies according to environmental changes.
05
The Long Arc of Spatial Intelligence
Finally, from a macro perspective, why is all this about the "world model" important?
In Fei - Fei Li's view, AI research in the past few decades has been constantly looking for the key to enable machines to enter the physical world. Now, we already have language models good at handling logic, and what we need next is a model for handling space. The core of spatial intelligence lies in how machines interact with the physical world they are in.
In this battle, it's not about who has more computing power, but about who can define the digital standards of the physical world.
The world model is by no means a simple algorithm optimization but a great feat of AI evolution.
"Language endows machines with the ability to talk about the world, while the world model is the way for machines to finally understand, imagine, reason, and interact with the physical world."
Everyone in this era is moving from the stage of talking about the world to a new era of truly understanding and reconstructing the world.
Nevertheless, the world model is only an intermediate node on the road to AGI, and the AI created by humans still has a long way to go to reach the real - sense "world model." Here, the somewhat extreme view of another leading figure in the world model, Yann LeCun, is worth sharing:
Optimistically, it will take at least five to ten years for machine intelligence to barely approach that of a puppy.
This article is from the WeChat official account "Silicon - based Stardust", author: Siqi, published by 36Kr with authorization.