HomeArticle

The five major sects of the world model besiege the Bright Summit.

爱范儿2026-04-17 16:01
After reading ten thousand volumes of books, one should travel a thousand miles.
  • After the Spring Festival, AMI, a new company founded by Turing Award winner Yann LeCun, secured $1.03 billion in financing, setting a record for the seed round of a European AI company.
  • A few weeks before AMI received its funding, World Labs, founded by Fei-Fei Li, also announced $1 billion in financing.
  • Earlier this week, Jijia Vision obtained billions of yuan in financing, with a valuation exceeding tens of billions.
  • Yesterday, Alibaba released the world model "Happy Oyster" (HappyOyster).
  • Today, Qunhe Technology rang the bell at the Hong Kong Stock Exchange.

These companies are all competing in the same field: world models.

Yann LeCun once said, "Large language models are a dead end on the road to superintelligence." At first glance, this seems to deny the value of large language models, but the qualification is the achievement of AGI. On second thought, there is some truth to it.

We can understand it simply like this: ChatGPT can write code and solve problems, but it can't figure out the basic laws of the physical world. If you ask it to describe "an apple falling to the ground," it can talk eloquently. But if you ask it why the apple falls, it's actually just reciting from memory and may not really understand gravity.

The root of the problem is that the training data for large language models is internet text, while the real world is three - dimensional, continuous, and full of physical laws.

This is why world models have become the next area of focus for scientific research elites.

However, although everyone is talking about world models, there is no consensus on the meaning of these words. There are routes for video prediction, building 3D scenes, constructing simulation platforms, and directions starting directly from neuroscience...

Zhuokai Zhao, a research scientist at Meta, shared on his X account what he believes are the "five major sects" of world models.

What are these five major sects?

This article is based on Zhao's tweets and expands on the research with reference to multiple sources. It is hoped that it will be helpful to you if you are interested in understanding world models.

JEPA School: Yann LeCun's "Abstract Philosophy"

JEPA stands for Joint - Embedding Predictive Architecture. It is a new type of AI architecture proposed by Yann LeCun and others.

Simply put, the core idea of JEPA is to let AI learn the laws of the world's operation through "observation" like humans, rather than rote - learning pixels or words: AI doesn't need to remember the position of every leaf; it just needs to know that the wind will blow the leaves off.

In Yann LeCun's view, models like Sora are essentially "predicting the next frame pixel by pixel." He believes this is physically impossible because in a world full of randomness, you can't accurately predict the falling trajectory of every leaf.

JEPA's solution is to make predictions in an abstract "representation space" rather than predicting pixels.

The specific approach is to first use an encoder to convert the video into an abstract mathematical representation, and then predict "what will happen" in this space (latent space). For example, predict the more "long - term" and physically - consistent result like "the ball will roll off the table" instead of repeatedly predicting each frame of the ball rolling.

V - JEPA 2 is currently the representative work of this route. The model has 1.2 billion parameters and is pre - trained on 1 million hours of unlabeled videos. Most amazingly, it only needs 62 hours of robot data to achieve zero - shot planning actions. In an unfamiliar environment dealing with unfamiliar objects, the success rate can reach 65 - 80%.

Compared with traditional robot learning methods, which may require thousands of hours of demonstration data, V - JEPA 2 extremely compresses the data requirements.

Yann LeCun's exact words were: If the representation is good enough, you don't need to train from scratch for each task.

However, after founding the AMI company, this Turing Award - winning scientific research giant also had to face reality. He said that the commercial products of AMI may not be available for several years.

This is a long - term investment, but capital is willing to take the risk. AMI has already received over $1 billion in the first - round financing, with investors including almost all well - known industry and cross - industry giants you know.

Space Intelligence School: Fei - Fei Li's "Architect" Route

If JEPA focuses on "temporal prediction," World Labs, founded by another AI basic research giant, Fei - Fei Li, has targeted another dimension: "space reconstruction."

The divergence between these two routes starts from the underlying logic.

JEPA believes that the core of intelligence is to predict "what will happen next" at an abstract level, so it doesn't care about pixel - level details and pursues efficient causal reasoning.

Fei - Fei Li's starting point is different. She believes that true intelligence requires an explicit understanding of the three - dimensional world, including geometric structures, depth relationships, and the relative positions of objects.

Put another way: JEPA wants to teach AI to understand the rule that "the ball will roll off the table," while World Labs wants to teach AI to understand "how high the table is, where the ball is on the table, and what the distance between the floor and the table is."

The former cares about the logical chain of events, while the latter cares about the physical structure of space.

This difference directly determines the product form. In November 2025, World Labs released its first product, Marble. By inputting a text description, a photo, a video, or even a rough 3D sketch, Marble outputs not a video but an editable, navigable, and exportable 3D world.

You can rotate the perspective, move objects, change lighting conditions in it, and export the result in Gaussian Splat, triangular mesh (mesh), or video format, which can be directly dragged into Unreal Engine or Unity for use.

There is also a technical detail that is easily overlooked: many video - generation models can produce beautiful images, but in essence, they are "telling a story" frame by frame, without a unified 3D structure to support the frames.

However, the 3D scene generated by Marble has "spatial consistency." It maintains a real spatial representation at the bottom, so when you turn back, the world remains the same.

The team configuration of World Labs is also worth mentioning: Co - founder Ben Mildenhall is the inventor of NeRF (Neural Radiance Fields), which redefined the understanding of 3D reconstruction in the field of computer vision; another co - founder, Christoph Lassner, is an expert in 3D graphics.

This team's knowledge structure determines that World Labs has been following an "explicit 3D" route from the beginning, rather than "hinting" 3D relationships from 2D videos.

In February 2026, World Labs announced the completion of $1 billion in financing, with investors including NVIDIA, AMD, and Autodesk.

The product Marble mentioned earlier has also been launched for ordinary users and commercial scenarios, and is used by film and television studios and game developers.

However, Marble currently has obvious limitations. After walking a few steps in the generated 3D world, visual distortion will start to appear, resulting in so - called "hallucinations."

This is in contrast to the "understanding of physical laws" pursued by the JEPA route: World Labs is good at reconstructing the "appearance" of space, but its understanding of "what will happen" in space is relatively weak.

Fei - Fei Li herself also admits that Marble is just the first step. She defines the ultimate goal as "spatial intelligence," meaning that AI can not only understand the structure of a scene but also perform reasoning, planning, and interaction in it. This road is still long, but the direction is clear: starting from the explicit modeling of 3D space and gradually adding an understanding of physics and causality.

Learning - based Simulation School: DeepMind's "Dream Maker"

DeepMind's Genie 3 may be the world model approach closest to "magic" at present.

Google's route is different from the previous two. What it does is more forward - looking and direct than "understanding the world" and "reconstructing space": Create a virtual environment that is real enough and allows real - time interaction, so that AI can develop real - world skills directly in it.

By inputting the sentence "Rowing a boat in the Venetian Canal during a storm," it can generate a 720p, 24fps 3D environment. You can control the character to move, operate props, and even modify the weather in it.

If you break a vase, the fragments will stay on the ground. If you leave and come back, the fragments will still be there. That is to say, the "persistence" of Genie 3 has been refined from environmental persistence to "object permanence."

However, this places high requirements on the computing architecture. Shlomi Fruchter, the research director at DeepMind, said that to achieve real - time interaction, the model needs to query information from a minute ago multiple times per second.

Such Genie 3 is very much like a running game engine. However, after being exaggerated by self - media, a common misunderstanding has emerged, that is, Genie 3 is a substitute for game engines.

Actually, this is not the case. It doesn't have a real hard - coded physics engine, and all behaviors are "learned" by the model from the training data.

This is both an advantage and a disadvantage. The advantage lies in its flexibility: the model can infer physical properties and collision rules by itself; the disadvantage is that its physical simulation is still not as accurate as traditional engines (hard - coded).

As for persistence, limited by the aforementioned computing architecture and computing power, currently Genie 3 can only maintain coherence for a few minutes, after which the picture starts to distort - which is unacceptable for games.

So far, DeepMind has only solved the problem of "creating an environment." What about training AI? That's where another thing developed by Google, Dreamer, comes in.

DreamerV4 was published in October 2025. It is a world model framework that can learn completely in "imagination" without interacting with the real environment.

It became the first AI to mine diamonds in Minecraft purely based on offline data. You know, mining diamonds from scratch requires making more than 20,000 precise mouse and keyboard operations in a row, including chopping trees, making tools, mining, smelting, and avoiding monsters and dealing with various emergencies in between.

Previously, OpenAI's VPT model needed 270,000 hours of labeled videos and 194,000 hours of online reinforcement learning to complete a similar task. DreamerV4 uses only one - hundredth of that amount of data.

DeepMind is currently promoting the combination of the "generated environment" and the "virtual agent" for training in a completely virtual but closed - loop environment.

The core bet of Google's route is that although pixel - level generation doesn't equal physical understanding, if the generated environment is real and diverse enough, the agents trained in it may be able to generalize to the real world. This is an unproven assumption and the biggest risk of this route.

Selling Water and Shovels: NVIDIA as an Infrastructure Supplier

The previous three routes each have their own technological ideals, but they all face the same real - world problem: Training world models requires an extremely large amount of data and computing power. Who will provide these basic conditions?

NVIDIA's Cosmos platform is answering this question. Its positioning is clear: You're all building world models? I'll provide the tools for building world models...

Cosmos contains several core components. First is the data processing pipeline, Cosmos Curator, which can process 20 million hours of video in 14 days, accelerating the training of world models. In contrast, traditional CPU - based solutions would take more than three years to process the same amount of data.

Second is the visual Tokenizer. Just as large language models break text into "tokens" for processing, world models need to break video frames into some computable representations. Cosmos' Tokenizer has an 8 - fold higher compression rate than industry solutions, supports various video ratios and durations, and can handle various formats from the first - person view of robots to fisheye lenses for autonomous driving.

Finally, there are three key families of pre - trained models: the prediction model Cosmos Predict, which is responsible for predicting the future state of the environment; the simulation model Cosmos Transfer, which migrates simulations to the real world; and the reasoning model Cosmos Reason, which enables robots to make plans. These pre - trained models