When the World Model Arrives: How Should AI Trainers Re-Understand Their Work?

There are some fundamental changes taking place in AI.

From the 'library-style intelligence' of large language models to the 'visual mapping' of multimodal models, and then to the ability of world models to endow AI with the prediction of physical laws, this paradigm shift not only reshapes the technical route but also pushes the role of trainers from data annotators to 'world rule designers'. This article deeply analyzes the underlying logic behind the sensation caused by Sora and the hidden front lines of the layout of major tech companies from the unique perspective of frontline AI trainers.

To be honest, when I first entered the field as an AI trainer, my understanding of this position was very vague.

Annotating data, writing prompts, providing RLHF feedback, and evaluating the output quality of models... The daily work seemed like a series of scattered tasks. It was hard to say exactly what I was training and in which direction I was heading.

It wasn't until the concept of world models began to appear frequently in my field of vision that I truly felt for the first time that something fundamental was changing in AI. It wasn't about getting smarter or faster; it was about starting to understand the world.

In this article, from the perspective of an AI trainer, I'd like to talk about what world models are, what the relationship is between them and the large language models and multimodal models we are already familiar with, and what this paradigm shift means for those of us working on the front line of AI training.

LLMs Misled Us about "Intelligence"

Before talking about world models, I'd like to first discuss an important misunderstanding brought to us by large language models.

After the emergence of ChatGPT, many people, including myself, really thought for a while that AGI was coming soon. GPT - 4 can pass the bar exam, write articles comparable to those written by humans, explain quantum mechanics, and help you debug code... The combination of these abilities makes it hard not to have the illusion that this thing "understands" a lot.

However, in actual work, you will gradually notice some strange gaps.

When I was doing RLHF annotation, I once gave the model a very simple spatial reasoning question: There is an apple on the table. There is a book next to the apple, and there is a glass of water on the left side of the book. Question: What is the relative position between the apple and the water?

The model's answers were unstable. Sometimes it was correct, and sometimes it was wrong. Moreover, when you asked it why it made such a judgment, it could give a seemingly completely reasonable explanation, regardless of whether the answer was right or wrong.

This made me realize one thing: LLMs master "the language description of the world" rather than "the operating laws of the world itself."

This difference may sound subtle, but in fact, it is very fundamental.

For example, imagine a person who has never left the library. He has read all the books about swimming, can recite the technical essentials of the butterfly stroke, can analyze the details of Phelps' movements, and can write a professional swimming teaching article. However, if you throw him into the swimming pool, he will probably sink.

LLMs are like this person in the library.

Its training goal is to predict the probability distribution of the next word given all the previous words. In mathematical terms, it is to maximize P(token_t | all previous tokens). This goal enables it to learn the statistical patterns of human language, but the statistical patterns of language do not equal the causal laws of the world.

The sentence "Fire is hot" appears countless times in the training data, so LLMs "know" that fire is hot. But what it doesn't know is: If you reach your hand towards the fire, how the temperature will be conducted to your skin according to physical laws, at what temperature the proteins in your skin will start to denature, and whether this process is reversible or not.

LLMs have always been absent in understanding the "whys" behind these "knowings."

Multimodal Models Enable AI to "See" but Not "Experience"

The emergence of multimodal models is an important step forward.

When models like GPT - 4V and Gemini can understand images, we gain a new dimension of ability: AI begins to be able to perceive the visual world. OCR, image description, visual question - answering... These abilities have great value in many practical application scenarios.

However, the essence of multimodal models is to establish a mapping relationship between visual features and language descriptions.

What it learns is: This visual pattern corresponds to this language description. A picture of a cat corresponds to the word "cat" and all the language knowledge about cats. The more accurate this correspondence relationship is learned, the stronger the multimodal ability of the model will be.

The problem is that this is still a static and superficial understanding.

For example, if you show a multimodal model a photo of a billiard table and then a photo of the moment when the cue hits the ball, it can tell you that this is a billiards game and describe the color and position of the balls. But if you ask it: Where will the ball move after being hit? Will it bounce off the border? Where will it finally stop? These questions involve the prediction of physical trajectories, and the performance of the multimodal model will become very unstable.

The reason is simple: The multimodal model has seen countless pictures of billiards, but it has never "acted" in the billiards world.

Seeing and experiencing are two fundamentally different sources of intelligence.

Humans have intuition and physical common sense because we have been exploring the real world since infancy. Through countless actions and feedback, we have established a set of operating models of the world in our brains. When you see a cup placed on the edge of the table, you instinctively feel worried. This intuition doesn't come from books but from the experience of accidentally breaking a cup.

LLMs have not experienced, and multimodal models still have not experienced.

And world models are exactly designed to address this lack of "experience."

World Models: AI Begins to "Predict the World" for the First Time

The concept of world models is not new.

In 2018, AI researcher David Ha and deep - learning pioneer Jürgen Schmidhuber published a paper titled "World Models," systematically proposing this framework. Their core idea is that in order for an agent to act in the world, it must establish an internal model of the world to predict the consequences of actions and then decide what actions to take.

This idea is actually very similar to the human cognitive way.

When you are driving, your brain doesn't process all sensor data in real - time to make decisions. Instead, based on your understanding of road rules, you continuously predict what will happen ahead and make judgments on the basis of these predictions. This "understanding of road rules" is the world model in your brain.

In more technical terms, the core training goal of world models is:

Given the current state S and the action A taken, predict the next state S'.

Compared with the training goal of LLMs, this simple formula has three fundamental differences:

First, it introduces the "action" dimension. LLMs predict words, multimodal models predict content, while world models predict "what will happen to the world after an action." This means that for the first time, AI truly combines "doing things" and "understanding."

Second, it establishes a causal relationship rather than a statistical relationship. If I push this cup, the cup will fall - this is a causal relationship. LLMs know that "the cup fell" and "push" often appear together in language, but it doesn't understand the causal chain between thrust, center of gravity, and friction. What world models need to learn is exactly this causal chain.

Third, it supports "counterfactual reasoning." This is the most exciting point for me. Counterfactual reasoning means: If I don't do this but do that, what will be the difference in the result? This "imaginary trial - and - error" ability is the basis of planning and decision - making and an important part of human wisdom. A real world model should be able to simulate multiple possible futures internally and choose the optimal action path.

Why Sora Shocked the Entire Industry

In January 2024, when OpenAI released Sora, it was actually the first time that the concept of world models was presented in a way that everyone could understand.

On the surface, Sora is a video - generation model. But what really shocked industry insiders is not how beautiful the videos it generates are, but the physical consistency shown in the videos.

Water flows downhill instead of randomly dispersing. When a collision occurs, the deformation direction of the object conforms to the force analysis. When the camera switches from one angle to another, the lighting relationship in the scene remains correct. When a ball is kicked, its movement trajectory conforms to a parabola instead of random movement.

These details were not explicitly taught to Sora. No one marked in the training data that "the water here should flow to the left," and no one wrote code to specify how the lighting should be calculated. These physical laws emerged spontaneously after the model was trained on a large amount of video data.

There is a passage in OpenAI's technical report on Sora that I think is the most important part of the whole article:

"We believe that video - generation models are a promising path to a general simulator of the physical world."

This sentence contains a lot of information. It means that when you train a large - enough model to predict the next frame of a video, it will be forced to learn the physical laws of the world because only by understanding the physical laws can it correctly predict what the next frame should look like.

This is a very elegant design of training signals. A video itself is a causal sequence - each frame is the result of the previous frame evolving according to physical laws. By predicting this sequence, the model is also quietly learning physics while learning language.

Of course, Sora is far from being a perfect world model. Sometimes it generates physically absurd content: a person stands up from a chair but the chair disappears out of thin air, the direction of water surface reflection is inconsistent with the light source, and the properties of objects in a long - term video are inconsistent... These errors precisely show that its world model is incomplete and fragmented.

But it shows a feasible direction.

The Essential Differences among the Three Routes: A Trainer's Understanding

In my work, I gradually formed a framework for understanding these three types of models, and I think the term "cognitive level" describes it most accurately.

Large language models solve the problem of "what to know."

Its core abilities are knowledge storage and retrieval, as well as language generation and understanding. It knows what has happened in history, the expressions of scientific laws, and how to explain things clearly. This is a very valuable ability, but its limitation is that it knows "the description of the world" rather than "the world itself."

Multimodal models solve the problem of "what to see."

Its core ability is perception, which is to convert sensory signals such as vision and hearing into semantic understanding. It can understand pictures, understand what is happening in videos, and associate information from different modalities. This extends the cognitive scope of AI from language to perception. However, it is still a static, screenshot - like understanding, lacking the modeling of temporal dynamics and action consequences.

World models solve the problems of "what will happen" and "how to do."

Its core abilities are prediction and planning. It needs to understand not the static attributes of things but the dynamic causal chains. It should be able to answer: If I do this, what will the world become like? Which path can help me achieve my goal? What will happen when this thing hits that thing?

From the perspective of a trainer, the data requirements of these three types of models are completely different.

LLMs need a large amount of high - quality text, with the core being wide coverage and accurate language. Multimodal models need high - quality image - text pairs or video - text pairs, with the core being accurate alignment between modalities. And world models need interaction sequences with action annotations - not only "what happened" but also "what actions led to what happened."

This represents an order - of - magnitude leap in the requirements for data collection and annotation.

Why World Models Are Booming Now

I have thought about this question for a long time. Since the concept of world models is not new - there was a foundational paper in 2018, why did it suddenly become the hottest topic in the industry around 2024?

I think several factors have combined to cause this boom.

The first factor is that the scaling law of LLMs has started to hit the ceiling.

From GPT - 3 to GPT - 4, each significant increase in the number of parameters brought amazing leaps in capabilities. However, after GPT - 4, the amplitude of these leaps has started to narrow significantly. The training cost has increased from tens of millions of dollars to hundreds of millions of dollars, but the improvement in capabilities is becoming increasingly difficult for users to perceive revolutionary changes.

What's more troublesome is the data problem. Some research institutions estimate that the available high - quality text data on the Internet will be basically "exhausted" by mainstream models between 2026 and 2028. The route of simply piling up data is facing physical limitations.

The industry has begun to realize that simply working in the language space may really have reached its limit.

The second factor is that the demand for embodied intelligence has suddenly become very urgent.

In 2024, the financing scale of the humanoid robot track reached a historical high. Companies such as Figure AI, 1X Technologies, and Physical Intelligence, which focus on general - purpose robots, have received large - scale financing. At the same time, Tesla's Optimus robot has started to perform real tasks in the factory, and Boston Dynamics' robots are also accelerating commercialization.

Robots need to work in the real physical world. They must understand physical laws, be able to predict the consequences of actions, and be able to plan in real - time in an uncertain environment. These requirements cannot be directly met by LLMs and multimodal models.

And world models are exactly the core infrastructure of the robot brain.

The third factor is that Sora has proven the feasibility of this route.

Before Sora, world models were more of an academic concept, and there were still many unsolved problems in engineering implementation. The emergence of Sora has proven that large - scale video pre - training can enable the model to develop physical understanding capabilities, and this route is feasible.

This has given a very strong signal to the entire industry: The next important paradigm has its first convincing engineering case.

The fourth factor is that Yann LeCun of Meta has been continuously "leading the trend."

As one of the three Turing Award winners in deep learning, Yann LeCun has been publicly expressing the view in the past two years that the existing LLM route can never achieve AGI, and true general intelligence must be based on world models. The JEPA series of architectures he leads in Meta is one of the most influential academic routes in world model research at present.

When a researcher of this level continuously and publicly supports a direction, the flow of capital and talent will change accordingly.

The combination of these four factors has created the background for the sudden popularity of world models in 2024.

What Are the Big Tech Companies Doing?

Understanding the layout of big tech companies is very helpful for understanding the development direction of this field.

OpenAI's route is the most ambiguous and the most curious. Sora is currently the commercial product closest to the concept of world models, but OpenAI has not clearly defined it as a world model. Instead, it is called a "simulator of the physical world." At the same time, the o1 and o3 series of models are moving in the direction of deepening reasoning ability - allowing the model to "think" for a longer time before answering. How these two lines will be integrated in the end is a question that the entire industry is speculating about.

Meta's route is the most clear. The JEPA architecture led by LeCun, combined with the video - dynamic modeling of V - JEPA 2.0, is the most systematic world model research route in the academic community at present. Meta's strategy is to open - source these research results, establish an advantage in academic influence, and at the same time provide technical reserves for its AR glasses and robot projects.

Google DeepMind is taking a multi - pronged approach. Gemini is responsible for general multimodal capabilities, the Genie series focuses on learning an interactive world model from videos, and there

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。