The world is too small to be used by world models.
The world model has become as chaotic as the real world.
OpenAI pointed at the videos generated by Sora and claimed it to be a "world simulator"; Yann LeCun pointed at Sora and said it was just a pixel hallucination, and the real world model should be an "abstract brain for predicting the future"; Google DeepMind stated that Genie3 is an "interactive general world model"; while Fei-Fei Li believes that "spatial intelligence" is the right answer.
The real world is unique and objective, but in the AI circle, it seems that everyone is creating their own "world model".
Despite the widely different definitions, these bickering bigwigs have reached a consensus on a basic judgment: the era of large language models will eventually come to an end, and the world model is the only way to achieve AGI.
After GPT - 3.5, large language models experienced an explosion in parameters. Before the technical routes of the world model converged, it first experienced an inflation of concepts.
The world model is like a basket, where everything can be put in
The chaos of the "world model" stems from the fact that it is a goal, referring to enabling AI to understand the laws of the external world and predict its changes, rather than a specific technical path.
The first thing to get confused is the concept.
The idea of the world model can be traced back to the "Mental Model" proposed by cognitive scientist Kenneth Craik in 1943. That is, the brain makes predictions by constructing a scaled - down model of the external world. In other words, we have a mental model in our brains that can not only process the currently seen information but also predict "what the world will be like if I do this".
Although this theory was introduced into reinforcement learning in the 1990s, it was the foundational paper "Recurrent World Models Facilitate Policy Evolution" published by Jürgen Schmidhuber et al. in 2018 that really gave it a name in the modern AI field. This paper systematically defined the framework of the neural network world model for the first time. At that time, it was a specific architecture composed of a visual component (VAE), a memory component (RNN), and a controller, and was trained in simple racing games and 2D shooting - like games.
Seven years have passed. With the explosion of large language models, the longing for general artificial intelligence has made this concept multiply like "lines and planes" in the past two years.
Yann LeCun proposed the concept of "Autonomous Intelligence" centered around the world model in 2022, emphasizing obtaining abstract representations through modular design and self - supervised learning. He then launched the I - JEPA and V - JEPA prediction models in 2023 and 2024 respectively.
Fei - Fei Li proposed the concept of "Spatial Intelligence" in 2024, founded World Labs, and newly released Marble. She advocates that the world model must have the ability to generate a physically consistent interactive 3D environment. "For me, spatial intelligence is the ability to create, reason, interact, and understand the profound spatial world, whether it is two - dimensional, three - dimensional, or four - dimensional, including dynamics and all of these."
Even the "Compression is Intelligence" mentioned by Ilya Sutskever, the former chief scientist of OpenAI, essentially believes that as long as the next token (whether text or pixels) can be losslessly compressed and predicted, the model has built a mapping of the world inside.
An abstract concept has given rise to more abstract concepts.
If we strip away the disputes over these definitions and look from the technical direction, the current world models are mainly divided into two major schools, corresponding to two completely different worldviews: the Representation school and the Generation school.
Yann LeCun belongs to the "Representation school", which is a minimalist route that does not generate images.
Analogous to the mental model in the human brain, our predictions and actions regarding the world are often intuitive, not based on physical formulas or specific images. Based on this, LeCun's world model is a "brain" hidden deep in the system backend. It only operates in the latent space after representation processing and predicts "abstract states".
In this tweet, LeCun clearly defined that a world model needs to input four variables simultaneously: an estimate of the previous world state s(t), the current observation x(t), the current action a(t), and a latent variable z(t). By combining these four variables, it predicts the world state at the next moment s(t + 1).
This definition has two key points. One is that the world model predicts the "state" of the next moment rather than an image, and the other is that it can conduct causal inferences for continuous action interactions.
For example, when a car is approaching, it won't visualize the license plate number and reflections in the mind but will only calculate the state of "an obstacle approaching". This model is not for human viewing but for machine decision - making. It pursues logical causal deduction rather than visual realism. The I - JEPA (Joint Embedding Prediction Architecture) and V - JEPA proposed by LeCun both abandon the approach of generative AI of "predicting every pixel" because the real world is full of unpredictable noise (such as the texture of leaves), and AI should not waste computing power on generating these details.
The second major school is currently the most vocal "Generation school". The most core difference from Yann LeCun is that they aim to reconstruct and simulate the visual world.
This school often quotes a famous saying from physicist Richard Feynman: "What I cannot create, I do not understand." That is to say, as long as the model can generate the right world, it proves that it understands the physical laws of the world.
At the beginning of 2024, when introducing Sora, OpenAI mentioned that it is a world simulator. OpenAI believes that as long as the data volume is large enough, the model can understand physical laws by predicting the pixels of the next frame. By learning billions of video clips, it remembers the probability distributions of "legs alternate when a person walks" and "a glass will break when it falls".
Sora is highly controversial as a world model. The most direct point is that it cannot respond to LeCun's causal law regarding actions and world states. If the model can only generate videos like playing a movie and cannot answer questions like "how will the ball fly if I kick it", then it may only remember the "probability of the ball's flight trajectory" rather than understand the "laws of mechanics".
So, what if this video generation can predict the next frame in real - time based on the user's action input?
Thus, the Generation school has given rise to a further form: Interactive Generative Video (IGV), such as Genie3.
Different from Sora, the difference of IGV lies in real - time and interactivity, that is, the presence of actions. Google DeepMind's Genie 3 is clearly positioned as a "general - purpose world model". It allows users to enter the scene and interact with it, supporting the generation of real - time images with a resolution of 720p and a frame rate of 24fps. Users can freely navigate, such as driving or exploring complex terrains from a first - person perspective. This means that the model not only understands the images but also understands the causal relationship between actions and environmental changes, although currently, these actions are limited to the up, down, left, and right of the direction keys.
Finally, there is the "3D Spatial Intelligence" advocated by Fei - Fei Li, with Marble released by World Labs as the latest representative.
If the previous two are dealing with video streams, Marble attempts to build a persistent and downloadable 3D environment from the bottom up.
The technical foundation of this route is closer to "3D Gaussian Splatting". It does not rely on traditional mesh modeling but represents the world as thousands of colorful fuzzy small spots (Gaussian bodies) floating in space. Through the aggregation of these particles, the model can render beautiful three - dimensional images and allows users to generate through prompts and freely transform using the built - in editor, supporting one - click export to engines such as Unity.
Although Marble is still far from the spatial intelligence mentioned by Fei - Fei Li, it can be seen that she believes the first step to achieving spatial intelligence is to build a high - precision and physically accurate 3D space. We can summarize that different from Sora, Marble generates a 3D world that conforms to physical laws. Different from Genie3, Marble is not a real - time generated world, but its precision and restoration degree are higher.
However, the achievements of these routes have not reached the expected appearance of the world model. They are even arguing with each other, and each has a group of supporters, which has led to the infinite expansion of the concept of the "world model".
Nowadays, all upstream and downstream projects related to environmental understanding and simulation, whether it is structured vertical fields such as embodied intelligence, autonomous driving, and game videos, or technologies such as generative video, multimodal models, video understanding, and 3D models, and even DeepSeek OCR for visual information compression, are all actively or passively associated with the world model.
The world model is becoming more and more like a basket where everything can be put in.
With both bubbles and ambitions, the world model is an "anti - LLM - centric" narrative
If it were just differences in technical routes, it would not be enough to explain why the "world model" has exploded this year. Behind the upsurge, there are intertwined capital anxiety, technological bottlenecks, and the longing for AGI.
We must first admit that there is a huge bubble in it.
In the venture capital circle, narratives are often more valuable than code. When the competitive landscape of "large language models" has been determined and OpenAI, Google, etc. have divided up the world of basic models, latecomers and vertical application developers urgently need a new story to impress investors.
A "video generation model" sounds like a tool software with a limited ceiling. But once it is renamed the "world model", it instantly rises to the height of AGI.
This is also an interesting phenomenon in the current AI era: Researchers are massively entering the business world to start companies, and academia and business are overlapping.
In the pure scientific research world, all innovations must be based on rigorous axioms. If you want to solve a problem (such as achieving AGI), you must first precisely define the problem. However, when a lab becomes a company and academic giants become CEOs, this "definition dispute" originally confined to journals is thrown into the business world.
In scientific research, different routes can coexist. But in a startup, resources are limited. If definition A is correct, the billions of investments of Company B may be wasted. A difference in definition corresponds to billions of computing power investment directions, inventory in the upstream and downstream industrial chains, and the value reconstruction of investors.
When we put aside the definition disputes and hype, the rise of the world model also seems to be a movement of "anti - LLM - centrism".
The entire AI industry has a collective technological anxiety about large language models (LLMs). This anxiety stems from the innate defects of LLMs: they are "disembodied". LLMs are trained in a pure text symbol system. It knows that the word "apple" often appears together with "red" and "sweet", but it has never really "seen" an apple and cannot understand the gravitational acceleration when an apple falls to the ground. Moreover, as the data scale expands, the marginal benefit of AI improvement is decreasing.
Whether it is Ilya Sutskever emphasizing "going beyond large models" after leaving OpenAI or Fei - Fei Li's "spatial intelligence", the core lies in one point: AI needs to shift from learning "what humans say" to learning "what happens in the world". The industry is shifting from simple text processing to the simulation and interaction of the physical reality because everyone realizes that the last piece of the puzzle to AGI is not in the text data on the Internet but in the real physical world.
We just hope that the term "world model" won't be ruined before it really emerges.
This article is from the WeChat public account "Silicon Star GenAI", author: Huang Xiaoyi, published by 36Kr with authorization.