Five Major Groups "Besiege" Large Models: A Strategic Encounter

$2 billion pours into world models, and the next battle of AI begins.

Yann LeCun's new company, AMI, has secured $1.03 billion in financing, setting a record for the seed round of a European AI company. Just a few weeks before AMI's funding, Fei-Fei Li's World Labs also announced a $1 billion financing round.

Both companies are in the same field: world models.

Yann LeCun once said, "Large language models are a dead end on the road to superintelligence." At first glance, this seems to deny the value of large language models, but the qualification is the achievement of AGI. On second thought, there is some truth to it.

We can understand it simply like this: ChatGPT can write code and solve problems, but it can't figure out the basic laws of the physical world. If you ask it to describe "an apple falling to the ground," it can talk about it in detail. But if you ask it why the apple falls, it's actually just reciting from memory and may not really understand gravity.

The root of the problem is that the training data for large language models is internet text, while the real world is three - dimensional, continuous, and full of physical laws.

This is why world models have become the next research direction for scientific elites.

However, although everyone is talking about world models, there is no consensus on the meaning of these words. Some are working on video prediction, some are building 3D scenes, some are constructing simulation platforms, and there are also directions starting directly from neuroscience...

Zhuokai Zhao, a research scientist at Meta, shared on his X account what he believes are the "five major schools" of world models.

What are these five major schools? This article is an expanded version based on Zhao's tweets and references from multiple sources. We hope it will be helpful to you if you're interested in understanding world models.

JEPA School: Yann LeCun's "Abstract Philosophy"

JEPA stands for Joint - Embedding Predictive Architecture. It is a new type of AI architecture proposed by Yann LeCun and others.

To put it simply, the core idea of JEPA is to let AI learn the operating laws of the world through "observation" like humans, rather than rote - memorizing pixels or words: The AI doesn't need to remember the position of each leaf; it just needs to know that the wind will blow the leaves off.

In Yann LeCun's view, models like Sora essentially "predict the next frame pixel - by - pixel." He believes this is physically impossible because in a world full of randomness, you can't accurately predict the falling trajectory of each leaf.

JEPA's solution is to make predictions in an abstract "representation space" rather than predicting pixels.

The specific approach is to first use an encoder to convert the video into an abstract mathematical representation, and then predict "what will happen" in this space (latent space). For example, predict the more "long - term" and physically - consistent result of "the ball will roll off the table" instead of repeatedly predicting each frame of the ball rolling.

V - JEPA 2 is currently the representative work of this route. This model has 1.2 billion parameters and is pre - trained on 1 million hours of unlabeled videos. The most amazing thing is that it only needs 62 hours of robot data to achieve zero - shot planning actions. In unfamiliar environments dealing with unfamiliar objects, the success rate can reach 65 - 80%.

Compared with traditional robot learning methods, which may require thousands of hours of demonstration data, V - JEPA 2 extremely compresses the data requirements.

Yann LeCun's exact words were: If the representation is good enough, you don't need to train from scratch for each task.

However, after founding the AMI company, this Turing Award winner and scientific research giant also had to yield to reality. He said that the commercial products of AMI may not be seen for several years.

This is a long - term investment, but capital is willing to take the bet. AMI has already received over $1 billion in the first - round financing, with investors including almost all well - known industry and cross - industry giants you know.

Space Intelligence School: Fei - Fei Li's "Architect" Route

If JEPA focuses on "temporal prediction," World Labs, founded by another AI basic research giant, Fei - Fei Li, has targeted another dimension: "spatial reconstruction."

The divergence between these two routes starts from the underlying logic.

JEPA believes that the core of intelligence is to predict "what will happen next" at an abstract level, so it doesn't care about pixel - level details and pursues efficient causal reasoning.

Fei - Fei Li's starting point is different. She believes that true intelligence requires an explicit understanding of the three - dimensional world, including geometric structures, depth relationships, and the relative positions of objects.

Put another way: JEPA wants to teach AI to understand the law that "the ball will roll off the table," while World Labs wants to teach AI to understand "how high the table is, where the ball is on the table, and what the distance between the floor and the table is."

The former cares about the logical chain of events, while the latter cares about the physical structure of space.

This difference directly determines the product form. In November 2025, World Labs released its first product, Marble. By inputting a text description, a photo, a video, or even a rough 3D sketch, Marble doesn't output a video but a 3D world that is editable, navigable, and exportable.

You can rotate the view, move objects, change lighting conditions in it, and export the result in Gaussian Splat, triangular mesh (mesh), or video format, and directly import it into Unreal Engine or Unity for use.

There is also a technical detail that is easily overlooked: Many video - generation models can create beautiful pictures, but in essence, they are "telling stories" frame by frame without a unified 3D structure to support the frames.

The 3D scenes generated by Marble have "spatial consistency." There is a real spatial representation maintained at the bottom, so when you turn back, the world remains the same.

The team configuration of World Labs is also worth mentioning: Co - founder Ben Mildenhall is the inventor of NeRF (Neural Radiance Fields), which has redefined the understanding of 3D reconstruction in the field of computer vision; another co - founder, Christoph Lassner, is an expert in 3D graphics.

The knowledge structure of this team determines that World Labs has been following an "explicit 3D" route from the beginning, rather than "hinting" at three - dimensional relationships from 2D videos.

In February 2026, World Labs announced the completion of a $1 billion financing round, with investors including NVIDIA, AMD, and Autodesk.

The product Marble mentioned earlier has also been launched for ordinary users and commercial scenarios, and is being used by film studios and game developers.

However, Marble currently has obvious limitations. The generated 3D world will start to show visual distortion after a few steps, resulting in so - called "hallucinations."

This contrasts with the "understanding of physical laws" pursued by the JEPA route: World Labs is good at reconstructing the "appearance" of space, but its understanding of "what will happen" in space is relatively weak.

Fei - Fei Li herself admitted that Marble is just the first step. She defines the ultimate goal as "spatial intelligence," which means that AI should not only understand the structure of a scene but also be able to reason, plan, and interact within it. This is a long - term goal, but the direction is clear: starting from explicit 3D spatial modeling and gradually adding an understanding of physics and causality.

Learning - based Simulation School: DeepMind's "Dream Maker"

DeepMind's Genie 3 may be the closest world - model concept to "magic" at present.

Google's route is different from the previous two schools. What it does is more forward - looking and direct than "understanding the world" and "reconstructing space": Create a virtual environment that is real enough and allows real - time interaction, so that AI can directly train practical skills in it.

By inputting the phrase "rowing a boat in the Venice Canal during a storm," it can generate a 720p, 24fps 3D environment. You can control the character to move, operate props, and even change the weather in it.

If you break a vase, the fragments will stay on the ground. If you leave and come back, the fragments will still be there. That is to say, the "persistence" of Genie 3 is further refined from environmental persistence to "object permanence."

However, this places high requirements on the computing architecture. Shlomi Fruchter, the research director at DeepMind, said that to achieve real - time interaction, the model needs to query information from a minute ago multiple times per second.

Genie 3 is very much like a running game engine. However, after being exaggerated by self - media, a common misunderstanding has emerged, that is, Genie 3 is a replacement for game engines.

In fact, this is not the case. It doesn't have a real hard - coded physics engine, and all behaviors are "learned" by the model from the training data.

This is both an advantage and a disadvantage. The advantage lies in its flexibility: the model can infer physical properties and collision rules by itself. The disadvantage is that its physical simulation is still not as accurate as traditional engines (hard - coded).

As for persistence, limited by the aforementioned computing architecture and computing power pressure, currently Genie 3 can only maintain coherence for a few minutes, after which the picture starts to distort - this is unacceptable for games.

So far, DeepMind has only solved the problem of "creating an environment." What about training AI? That's where another thing developed by Google, Dreamer, comes in.

DreamerV4 was published in October 2025. It is a world - model framework that doesn't need to interact with the real environment and learns completely in "imagination."

It became the first AI to mine diamonds in Minecraft purely based on offline data. You know, mining diamonds from scratch requires more than 20,000 consecutive and precise mouse and keyboard operations, including chopping trees, making tools, mining, smelting, and also avoiding monsters and dealing with various emergencies in between.

Previously, OpenAI's VPT model needed 270,000 hours of labeled videos plus 194,000 hours of online reinforcement learning to complete a similar task. DreamerV4 uses only one - hundredth of that amount of data.

DeepMind is currently promoting the combination of the "generated environment" and the "virtual agent" to train in a completely virtual but fully - closed environment.

The core bet of Google's route is that although pixel - level generation doesn't equal physical understanding, if the generated environment is real and diverse enough, the agents trained in it may be able to generalize to the real world. This is an unproven hypothesis and also the biggest risk of this route.

Selling Water and Shovels: NVIDIA as an Infrastructure Supplier

The previous three routes each have their own technological ideals, but they all face the same real - world problem: Training world models requires an extremely large amount of data and computing power. Who will provide these basic conditions?

NVIDIA's Cosmos platform is answering this question. Its positioning is clear: You're all building world models? I'll provide the tools for building world models...

Cosmos includes several core components. First is the data - processing pipeline, Cosmos Curator, which can process 20 million hours of video in 14 days, accelerating the training of world models. In contrast, traditional CPU - based solutions would take more than 3 years to process the same amount of data.

Second is the visual Tokenizer. Just as large language models split text into "tokens" for processing, world models need to split video frames into some computable representations. Cosmos' Tokenizer has a compression ratio 8 times higher than industry solutions, supports various video ratios and durations, and can handle various formats from the first - person view of robots to the fisheye lenses of autonomous vehicles.

Finally, there are three crucial pre - trained model families: the prediction model Cosmos Predict, which is responsible for predicting the future state of the environment; the simulation model Cosmos Transfer, which transfers simulations to the real world; and the reasoning model Cosmos Reason, which enables robots to make plans. These pre - trained models are all released under open licenses, and developers can download them for free.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Five major groups "besiege" large models

JEPA School: Yann LeCun's "Abstract Philosophy"

Space Intelligence School: Fei - Fei Li's "Architect" Route

Learning - based Simulation School: DeepMind's "Dream Maker"

Selling Water and Shovels: NVIDIA as an Infrastructure Supplier