The rise of world models has reignited the fierce debate over AI development paths.
The evolutionary codes in the human brain that remain undeciphered may hold the key to the future of AI.
Recently, it was reported that Yann LeCun, a Turing Award winner and the Chief AI Scientist at Meta, is leaving to start his own business. He will focus on "World Models" as the core technology, continuing his long - standing exploration. This move has quickly caught the attention of the global AI community.
Fei - Fei Li, known as the "Godmother of AI", published a long article on her social platform, pointing out the computing power ceiling and cognitive limitations of current large language models (LLMs). She proposed that the future of AI does not lie in infinitely expanding model parameters but in implanting "Spatial Intelligence" — this basic cognitive ability, which humans are born with and awakens in infancy, is the necessary path to artificial general intelligence (AGI).
Meanwhile, World Labs, founded by Fei - Fei Li, launched its first product, Marble, on November 13th. With a multimodal world model as the core engine, it can generate a persistent three - dimensional digital twin space from a single image, video clip, or text description, building a key three - dimensional cognitive foundation for spatial intelligence.
As AI moves from a purely virtual context into the physical reality dimension, the complex constraints and dynamic interactions in the real world are calling for a disruptive cognitive model to break the deadlock.
A Disagreement on the Route of AI's Essence
Yann LeCun worked at Meta for 12 years. It's no secret that there were disagreements between his technological vision and the large language model path led by Mark Zuckerberg.
He once publicly stated, "Large language models will never achieve human reasoning ability." This statement points directly to the core contradiction in AI development: should we train machines to be better at chatting with text data, or should we let AI learn physical laws through visual observation like infants?
All along, large language models have been restricted by data quality and scale. Their cognitive boundaries are always confined by the "invisible wall" of training data.
Data bias can solidify the model's cognitive biases, noisy data directly dilutes reasoning accuracy, and the lag in timeliness traps the model in an "information time - lag", making it difficult to capture the dynamic evolution of the real world. Even if the data scale is continuously expanded, parameter stacking is gradually caught in the "scale curse", with the computing power consumption and effect gain showing a non - linear imbalance and the marginal benefit continuously decreasing.
The more core limitation is that the cognition of large language models is limited to the linear association of text symbols, lacking the ability to model the three - dimensional space of the physical world and the ability to conduct dynamic causal reasoning. It cannot accurately map the spatial topology, object attributes, and motion laws of the real world, nor can it understand the real - time interaction logic of "action - feedback", resulting in frequent cognitive breaks when applied across scenarios.
For example, Yann LeCun pointed out that large language models cannot accurately restore a three - dimensional scene through text description, nor can they make decisions that conform to physical common sense based on real - world constraints.
This model that relies on text data feeding will ultimately find it difficult to break through the "symbol cage" and cannot replicate the cognitive path of humans extracting abstract knowledge from concrete experiences.
When AI needs to move from virtual interaction to practical applications in the physical world and upgrade from single - task response to autonomous decision - making in complex scenarios, the pure - text - driven model architecture can no longer meet the evolutionary needs of general artificial intelligence. Only by stepping out of the data scale competition and turning to a structured understanding of the essence of the world can the next technological leap be initiated.
The "World Model School" generally believes that large language models have fundamental limitations. Fei - Fei Li emphasized that language is an abstract signal created by humans for communication. There are no words in nature. If AI only relies on text, it cannot truly understand the laws of the physical world and is likely to become a "master of words in the dark".
Yann LeCun has repeatedly criticized large language models as merely powerful text databases, lacking the ability to understand the real world. World models, on the other hand, are committed to directly modeling through high - dimensional perceptual data, bypassing language conversion, deducing physical laws in the latent space, and outputting action instructions to achieve an internal understanding and active reasoning of the environment.
Just as human infants can understand gravity without reading encyclopedias — they establish their understanding of the physical world by observing a falling cup with their eyes and touching the table with their hands. This is the key reason why LeCun advocates world models: the spatio - temporal information contained in dynamic video data is much closer to the essence of intelligence than abstract text.
For example, the moment a ball knocks down a stack of building blocks contains information about material hardness and also hides the laws of mechanics. The "Newton's laws" that large language models learn from Wikipedia are just statistical associations of symbols. Research at MIT has further proven that specific neural networks are activated when the brain processes spatial cognition — this biological instinct is the underlying ability that current pure - text AI lacks.
The term "Word Models" first appeared in a 2018 article titled "Recurrent World Models Facilitate Policy Evolution" published by Jurgen at the top machine - learning conference NeurPS. The article analogized world models to the mental models of the human brain in cognitive science, believing that mental models are involved in human cognition, reasoning, and decision - making processes, and the most core ability lies in counterfactual reasoning.
This model enables AI to have prediction and planning capabilities, such as understanding the principle of object fragmentation and predicting the turning trajectory of a vehicle, providing basic support for embodied intelligence, autonomous driving, and human - robot collaborative robots. Fei - Fei Li summarized it as upgrading "seeing" to "reasoning", transforming "perception" into "action", and turning "imagination" into "creation".
In recent years, with the continuous development of deep - learning technology and the increase in computing resources, significant progress has been made in the research of world models.
For example, the MuZero algorithm published by DeepMind in 2019, the JEPA representation model proposed by Yann LeCun in 2022, the video - generation model Sora in 2024, and the urban - environment generation model UrbanWord have all promoted the application exploration of world models in different fields.
Overall, a world model is a generative AI model that can simulate the real - world environment and generate videos and predict future states based on input data such as text, images, videos, and movements. It integrates various semantic information, such as vision, hearing, and language, and uses machine learning, deep learning, and other mathematical models to understand and predict phenomena, behaviors, and causal relationships in the real world.
Put simply, a world model is like the "internal understanding" and "mental simulation" of the real world by an AI system. It can not only process input data but also estimate states that are not directly perceived and predict changes in future states.
This model enables AI to have cognitive and reasoning abilities similar to humans, allowing it to conduct simulations and planning in a virtual "mind" to better cope with the complexity of the real world.
Different from large language models in a broad sense, world models do not understand real - world scenarios through available language, images, and videos. Instead, they learn the physical rules of the real world through a large amount of data and conduct causal reasoning to predict and generate a future that conforms to real - world laws. Its ultimate goal is to train artificial intelligence to adapt to the real world rather than the theoretical world, evolving AI into physical AI.
World models have three core characteristics:
Firstly, internal representation and prediction. World models can encode high - dimensional raw observation data (such as images, sounds, texts, etc.) into low - dimensional latent states, forming a concise and effective representation of the world. On this basis, it can predict the state distribution at the next moment given the current state and action, thereby achieving forward - looking prediction of future events.
Secondly, physical cognition and causal relationships. World models have basic physical cognitive abilities and can understand and simulate the laws of the physical world, such as gravity, friction, and motion trajectories. This enables it to provide more accurate and realistic predictions and decision - making support when dealing with problems related to the physical world.
Thirdly, counterfactual reasoning ability. World models can not only make predictions based on existing data but also engage in hypothetical thinking, that is, counterfactual reasoning. For example, it can answer questions like "What would happen if the environmental conditions changed?", providing more possibilities and ideas for solving complex problems.
Generally, a complete world model consists of three components: a state representation model, a dynamic model, and a decision - making model.
The role of the state representation model is to compress raw observation data (such as high - dimensional images and sensor data) into low - dimensional latent states, retaining key information and filtering noise. Common implementation methods use technologies such as variational autoencoders (VAEs). This compression and representation method enables the model to process and understand complex data inputs more efficiently.
The dynamic model is the core part of the world model, used to predict the next - state distribution of the environment given the current latent state and action. Recurrent neural networks (RNNs), long - short - term memory networks (LSTMs), or stochastic state - space models (SSMs) are usually used to learn state - transition laws, thereby building an implicit understanding of the physical laws of the world.
The dynamic model provides a virtual "sandbox" for the agent, allowing it to conduct simulations and experiments without the need for costly trial - and - error in the real environment.
Based on state prediction, the decision - making model uses methods such as model predictive control (MPC) or deep reinforcement learning to plan the optimal action sequence to achieve the goal. It evaluates the value or reward signal of different actions based on the predicted future states, thereby guiding the agent to take reasonable actions in the environment.
The Trigger Point for AI's Next Leap
In the past decade, every leap in AI has stemmed from changes in input methods: text brought language intelligence, and images gave rise to visual intelligence. Now, world models are enabling AI to understand the real world, a dynamic system with time, space, and causality.
Not only do almost all AI pioneers agree that world models are crucial for building the next - generation AI, but technology giants also regard world models as the key at the development node of artificial intelligence.
In recent months, several technology companies have successively announced their progress in the field of world models, highlighting the heating up of this track.
Google DeepMind's Genie series of models have been upgraded from 2D to Genie 3 in a year and a half. This model can generate interactive 3D environments in real - time. By inputting a sentence, users can create a dynamic world that they can freely explore at 720p resolution, and the scene details can remain coherent in a memory of up to one minute. Shlomi Fruchter, the co - leader of the Genie 3 project, said that by building an environment that simulates the real world, AI can be trained in a more scalable way, and "there is no need to bear the consequences of making mistakes in the real world."
Meta released the Code World Model (CWM), exploring how to use world models to improve AI code - generation performance. This model can think like a programmer, not just write code. Trained with 5T tokens of execution - trajectory data, CWM can simulate the code - running process line by line, from variable initialization to loop iteration, from function calls to exception throwing. It can accurately predict every state change, directly pushing AI programming from static text generation to a new era of dynamic execution and reasoning.
Meanwhile, Jensen Huang, the CEO of chip giant NVIDIA, asserted that the company's next major growth stage will come from "physical AI", and these new models will revolutionize the robotics field. NVIDIA is using its Omniverse platform to create and run such simulations to support its expansion into the robotics field.
Elon Musk, the CEO of Tesla, can be said to be one of the first to propose the concept of "world models". To achieve autonomous driving in all road conditions globally, Tesla has embedded an AI model between perception and decision - making, mainly to build a virtual environment for learning and verifying autonomous - driving capabilities.
This world - model approach has actually had a potentially huge impact on the real world. Moritz Baier - Lentz, a partner and investor at venture - capital firm Lightspeed, said that drone warfare, new types of robots, and autonomous vehicles safer than humans are all benefiting from it.
Gary Marcus, the former head of AI at Uber, pointed out that no matter how much data current generative AI is trained with, they can only build probabilistic models of how the world works. In essence, current AI learns the associations between input data — whether it's text, images, or molecules and their functions. This fuzzy approximate understanding of the world seems to be confusingly encoded in the AI "brain", containing both the data itself and a large number of complex rules for data processing, and these rules are often incomplete or self - contradictory.
A good example is that an Atari 2600 game console running a 1979 program can beat the most advanced chatbots in a chess game. These chatbots often try illegal moves and quickly forget the positions of the chess pieces. In essence, today's AI based on the Transformer architecture is making predictions, not logical reasoning, even though they have been trained with countless rule books.
Although world models show great potential, they also face many challenges.
Firstly, there are challenges at the technical and ecological levels. Building world models requires a large amount of multimodal data, including videos, audios, and sensor data. Collecting, annotating, and organizing this data is often costly and time - consuming. At the same time, the quality and diversity of the data will directly affect the performance and generalization ability of the model.
Moreover, world models lack the supporting engineering system for cross - platform collaboration. Currently, there are no standards for world models, lacking unified training corpora, comparable evaluation indicators, and public experimental platforms. Enterprises often work independently. If cross - model verifiability and reusability cannot be achieved, it will be difficult for the world - model ecosystem to form large - scale innovation.
Secondly, there are challenges at the cognitive level. The strength of world models lies in their ability to conduct internal deductions and predictions, but this also makes their decision - making processes increasingly difficult for humans to understand. Imagine, when a model can simulate thousands of results in the latent space, can we still trace its decision - making logic?
From the issue of liability attribution in autonomous driving to the possibility of goal drift between autonomous intelligences, and then to the question of whether the goals of AI are still consistent with those of humans. Once AI changes from passive execution to active learning, the issues of safety and ethics will rise from the technical level to the value level.
Thirdly, there are challenges at the industrial and ethical levels. The further development of world models will inevitably redefine industrial boundaries. AI may not only reconstruct the decision - making systems in fields such as transportation, manufacturing, healthcare, and finance but also