The debate on world models between Fei-Fei Li and LeCun
The path to AGI has finally converged on the battlefield of world models.
Fei-Fei Li has released her company's first commercial world model, Marble.
Almost at the same time, Yann LeCun left Meta and is preparing to establish his own world model company.
Before that, Google's world model, Genie 3, also caused a stir in the industry.
The three major forces in the AI field, although all entering the world model arena, represent three completely different technological route bets -
The Battle of World Models
Right after Fei-Fei Li published a long article advocating for spatial intelligence, her startup, World Labs, wasted no time in launching the first commercial world model, Marble.
The industry generally believes that Marble has commercial potential because it generates persistent and downloadable 3D environments.
The team said that this method can significantly reduce problems such as scene deformation and inconsistent details. Moreover, it can export the generated world as Gaussian splats, meshes, or even directly as videos.
Furthermore, Marble also has a built-in native AI world editor called Chisel. Users can freely transform the world according to their own ideas with just one prompt.
For developers working on VR or games, the process of "one prompt → directly generate a 3D world → export to Unity with one click" is very helpful.
However, a machine learning engineer on Hacker News pointed out that compared to the so - called world model, Marble seems more like a simple 3D rendering model.
Isn't this just a Gaussian Splat model? I've been in the AI industry for so long, and I still don't understand what the "world" in the "world model" really means.
The statement from a Reddit user is even more straightforward:
Converting pictures into 3D environments using Gaussian scattering, depth, and image inpainting is really cool, but this is just a 3D Gaussian generation pipeline, not the brain of a robot.
The Gaussian splatting here refers to a popular new technology in 3D modeling in recent years.
It represents a scene as thousands of colorful, fuzzy little spots (i.e., Gaussians) floating in space and then "splashes" these spots onto the screen to naturally blend them into an image.
You can think of it this way: A Gaussian is like a small, translucent, halo - surrounded, soft - edged bubble floating in three - dimensional space.
Of course, a single bubble can't form a shape, but if thousands of such bubbles gather together and are rendered from different angles, they can combine to create a beautiful three - dimensional picture.
This method doesn't require a complex modeling process like traditional photogrammetry. Although it sacrifices some accuracy, it is extremely fast and easier to operate.
Marble adopts exactly such an approach.
However, this also means that Marble may not be the "world model" that people expect, which can be directly used for robot training.
Marble does build a complete world, but what we actually see is just a view that can be directly converted into pixels by a renderer.
In other words, it captures "what the surface looks like" without built - in physical laws about "why the world operates this way".
This is completely sufficient for humans, but for robots, what's important is not these visual information but the underlying causal structure -
For example, a ball placed on a slope will roll down, which is obvious to humans at a glance.
But for a robot to make a similar judgment, it needs information such as mass, friction, and speed, which simply don't exist in Marble.
Perhaps for this reason, on Marble's own blog, although it often mentions "world model" and "exporting Gaussian scatterers, meshes, and videos", it hardly mentions robots at all.
However, in terms of commercialization, Marble clearly has more advantages.
Compared with the type of world model that can give birth to embodied intelligence, which is hotly debated in the AI circle, Marble is not a distant concept but a practical tool that can be immediately integrated into the daily work process of game developers.
But this also makes people a little disappointed. Is the "world model" path leading to AGI just a gimmick?
Of course not.
There are indeed world models that can truly interact with robots, such as LeCun's JEPA.
LeCun's understanding of the "world model" is not rooted in 3D graphics but in control theory and cognitive science.
It doesn't need to output beautiful pictures because you can't "see" this kind of world model at all.
The task of this type of world model is not to render beautiful pixels but to enable robots to think a few steps ahead and learn to predict changes in the world before taking action.
JEPA follows this path -
LeCun believes that for AI, only the middle abstract representation is important. The model doesn't need to waste computing power generating pixels but should focus on capturing the world states that can be used for AI decision - making.
So, although this type of model can't generate delicate 3D images like Marble and doesn't seem as "amazing", it is more like training the "brain" of a robot.
Its advantage lies in a more fundamental understanding of the world, so it is more suitable as a training ground for robots.
In comparison, Fei - Fei Li and LeCun's paths in the "world model" are almost opposite -
The former creates a front - end asset generator; the latter is more like a back - end prediction system.
Between these two titans, there stands a tech giant - Google.
In August this year, Google DeepMind launched a new version of the world model, Genie 3.
With just one prompt, the model can generate an interactive video environment where users can freely explore for several minutes.
What's most impressive is that Genie 3 is the first in this type of model to solve the problem of long - term consistency - there won't be situations like "turn around and the whole building disappears".
At the same time, it also supports triggering world events, such as "it starts to rain" or "night falls". The whole process is like an electronic game driven by a model rather than a traditional engine.
However, Genie should be more like a "world - model - style video generator".
Although Genie 3 makes the "world move", its core is still video logic, not the physics - and - causality - based logic like JEPA.
That is to say, although it can generate dynamic pictures, it can't fully "understand" the physical laws behind these pictures.
It can still be used for robot training, but it doesn't get to the essence as well as JEPA.
Meanwhile, the picture quality and resolution are also limited, and it can't be compared with Marble's high - precision, exportable 3D assets.
In summary, although the three "world models" all depict the "world", their understanding paths are completely different, and thus they each have their own advantages -
Marble renders "what the world looks like", Genie 3 shows "how the world changes", and JEPA explores "what the structure of the world is".
Almost all the "world models" on the market can generally be classified into these three paradigms:
The Pyramid of World Models
First type: The world model is the interface
Represented by Marble, it enables people to directly generate editable and shareable three - dimensional environments from text or two - dimensional materials.
In this mode, the "world" is the space presented on VR headsets, monitors, or computer screens where people can view and move around.
Second type: The world model is the simulator
Represented by Genie 3, this type of model can generate a continuous and controllable video - style world, allowing agents to repeatedly try, fail, and try again.
Agents like SIMA 2 can use this type of world as a "virtual gym".
Third type: The world model is the cognitive framework
Represented by JEPA, this is a highly abstract form without the pictures for people to appreciate like the first two types.
Here, the focus is not on rendering. The "world" is presented in the form of latent variables and state - transition functions, which can be said to be a perfect training base for robots.
In the view of Zhao Hao, a scholar from Beijing Academy of Artificial Intelligence, these three can actually be assembled into a "pyramid of world models" -
From bottom to top are Fei - Fei Li, Genie 3, and LeCun.
Looking up at this pyramid from the ground:
The higher up, the more abstract the model is and the closer it is to the way AI thinks, so it is more suitable for robot training and reasoning.
The lower down, the more real the model is for humans in terms of appearance, interaction, and visualization, but the more difficult it is for robots to understand.
Reference links:
[1]https://entropytown.com/articles/2025-11-13-world-model-lecun-feifei-li/
[2]https://mp.weixin.qq.com/s/D7G3S_AIfzQfITgqXIKQAg
This article is from the WeChat official account "QbitAI". Author: Jay. Republished by 36Kr with permission.