First - Year Revelation of World Models: Motivations, Turmoil, and Hidden Reefs

Who can establish standards in the chaos?

On April 16th, Alibaba released the open-world model Happy Oyster, and Tencent open-sourced the 3D world model HY-World 2.0.

On the same day, these two Chinese internet giants asserted their presence in the world model race.

Less than a month prior, Fei-Fei Li's World Labs completed a $1 billion financing round, and Yann LeCun's AMI Labs shocked Silicon Valley with a $1.03 billion seed round.

Capital, giants, and entrepreneurs are flocking in, and a resounding slogan has quickly spread through the industry: World models are the most important track after large language models.

But if you really ask these players "What exactly is a world model?", you're likely to get a bunch of contradictory answers.

Some say it's an "interactive 3D world", some say it's a "causal model that understands physical laws", some say it's a "digital simulator for robot training", and some simply say it's "more advanced video generation".

This isn't a disagreement in academic discussions; it's the cognitive chaos the entire track is going through.

This article attempts to sort out this chaos. We'll start with three progressive questions: Why are all the big tech companies suddenly betting on world models? What exactly are their products doing, and which aspects are real and which are illusory? And, how deep are the dilemmas and ambiguous areas hidden behind the glory?

I. Why the sudden all-in on world models?

To understand why world models have suddenly become so popular, we need to go back to an awkward fact about large language models.

In the past two years, ChatGPT and similar models have demonstrated amazing language abilities, but they've also exposed a fatal flaw: They don't understand the physical world.

If you ask an LLM "What will happen if you push a cup off the edge of a table?", it can answer "The cup will fall to the ground", but it doesn't really understand gravity, acceleration, or collision. It just remembers similar sentences from the training data.

A study in early 2026 pointed out that hallucinations aren't a data problem or a training problem; they're an inherent flaw in the LLM architecture.

This flaw might be tolerable in pure text tasks, but when AI enters the real world - controlling robots, driving cars, or working in factories - it becomes an insurmountable obstacle. You can't have an autonomous driving model make "approximately correct" judgments about obstacles ahead, and you can't have an industrial robot "roughly" predict the movement trajectory of parts.

So, a more fundamental need has emerged: We need an AI that can understand the causal laws of the physical world.

It not only needs to be able to talk but also to act; not only to see but also to predict. This is the fundamental reason why world models have been pushed into the spotlight.

Large language models have changed the relationship between humans and information, while world models aim to change the relationship between humans and reality.

In the past two years, the commercialization of AI has mainly been limited to information processing, such as writing copy, doing translations, and generating code. But the next wave of growth engines is clearly in the physical world: embodied intelligence, autonomous driving, and intelligent manufacturing.

The common requirement for these scenarios is that AI must understand space, predict dynamics, and plan actions.

So, when big tech companies bet on world models, they're essentially competing for the technological high ground in the "post-LLM era". Whoever enables AI to truly understand the physical world first will dominate the next industrial cycle.

The approaches of players at home and abroad are very different.

In the United States, DeepMind, World Labs, and AMI Labs are more like doing basic science.

They're concerned with how to give AI physical intuition and causal reasoning abilities like humans, and commercialization is a long-term goal. Yann LeCun himself admitted that AMI's products might not be available for several years.

In China, it's a different story. Alibaba and Tencent almost immediately tied their models to commercial scenarios when they released them: Happy Oyster targets paying users in film and television production and game development, and HY-World 2.0 directly outputs 3D assets that can be imported into Unity/UE, starting a business in AI world-building.

There's also VidMuse from Sand.ai, which focuses on the niche scenario of generating videos from music and achieved an annual revenue in the tens of millions of dollars within a few months of its launch.

The logic of Chinese teams is very practical: A world model must first be a profitable product.

These two routes aren't superior or inferior to each other, but they determine their respective rhythms and risks. US teams are willing to bet on breakthroughs a decade from now, while Chinese teams must see returns within a year.

The problem is that when everyone is shouting slogans under the same hot term, it's hard for outsiders to tell what each is doing.

II. Interrogation about technical standards

If you take the time to read the introductions of various products, you're likely to be even more confused. Because each world model looks different, and their underlying logics are even contradictory.

Let's first look at the most counterintuitive faction. Yann LeCun's AMI Labs has taken a path that few dare to follow. They don't think AI needs to generate realistic images.

LeCun's JEPA architecture deliberately discards pixel details and only makes predictions in the abstract latent space. The newly released LeWorldModel has only 15 million parameters and can be trained in a few hours on a single GPU, but its planning speed is 48 times faster than traditional methods.

The drawback is that its output is incomprehensible to humans. You can't "see" the future it predicts; you can only trust that it calculated correctly.

This is a purely academic route, far from ordinary users, but LeCun is betting that true intelligence doesn't need to simulate the fall of every leaf; it only needs to understand the causality of "the wind will blow the leaves off".

Another route comes from Fei-Fei Li's World Labs. Fei-Fei Li believes that intelligence must be based on an explicit understanding of three-dimensional space. Her Marble model can generate an editable and navigable 3D world from a photo or a piece of text, and users can freely move the perspective inside.

World Labs also open-sourced the rendering engine Spark 2.0, enabling ordinary browsers to smoothly load hundreds of millions of 3D points.

An honest evaluation is that Marble is good at reconstructing the appearance of space, but its understanding of what will happen in space is relatively weak.

You can enter the room it generates, but you can't push the chair inside or knock over the cup on the table. It's a reproducer of a static world, not a simulator of dynamic physics.

The most bustling camp is the generative faction. Google's Genie 3, Alibaba's Happy Oyster, and Tencent's HY-World 2.0 all belong to this category.

Their logic is that as long as the generated images are realistic enough and the interaction is smooth enough, the physical laws will naturally be learned.

Alibaba has an interesting feature in Happy Oyster called the director mode, where users can input text commands at any time during video playback to change the plot direction and switch the camera angle. Tencent is more practical, directly outputting 3D assets that can be edited again, allowing game developers to import them directly into the Unity or UE engine.

But these products have a common weakness: Long-term consistency and physical accuracy are still unstable.

Genie 3's demonstration is amazing, but the picture starts to distort after a few minutes. Alibaba's roaming mode currently only supports 1 minute of continuous displacement. What will happen if it exceeds this time? The official didn't say.

Tencent's 3D assets look good in a single scenario, but its advantages are mainly reflected in the scene completeness and the degree of compliance with the input pictures. These are all "looks like" indicators, not "physically correct" indicators.

Finally, there's a special player: NVIDIA. The Cosmos platform doesn't produce world models; it produces "tools for producing world models".

The data processing pipeline, video tokenizer, and pre-trained basic model are all available for free download. Jensen Huang has a clear plan: No matter which route ultimately wins, training and inference will require NVIDIA's GPUs.

This is the smartest business, not betting on the direction but on computing power.

So, which of these world models live up to their names? A key technical standard is that a true world model must be "action-conditioned", that is, when an action is input, the model should be able to output the change in the world state.

If you press "W" on the keyboard, the perspective in the picture should move forward; if you give a robot a grasping instruction, the model should predict the change in the object's position.

According to this standard, Fei-Fei Li's Marble isn't very qualified. Users can only watch, not act. It's more like a 3D reconstruction tool than a world simulator.

Although Google's Genie 3 and Alibaba's Happy Oyster support interaction, their physical accuracy is questionable. Tencent's HY-World 2.0 outputs static assets and doesn't involve dynamic prediction.

In other words, almost no one in the current market meets the standard of a "perfect physical world simulator". Each company has chosen a presentable and commercializable entry point within its capabilities.

This isn't wrong in itself, but the problem is that everyone is using the vague term "world model" to package themselves, making the outside world think they've solved all the problems.

III. The deliberately avoided ambiguous areas

Just reading the press releases of various companies, you'd think that world models are on the verge of large-scale implementation, but some overlooked details paint a very different picture.

The data problem is the most prominent. Training a true world model requires a massive amount of "observation, action, result" triples, but there's no such ready-made dataset in reality.

Some use game data, which has perfect action labels, but the physics in games is simulated by the engine, not real physics.

Some use first-person human videos, which are closest to the real world, but there are no action labels in the videos, and the head movements and hand actions of humans are entangled, making it impossible for the model to tell who is moving.

Some use real robot teleoperation data, which has the highest fidelity, but it might cost tens of thousands of dollars to collect one hour of data, and it's impossible to scale up.

This means that each world model has an inherent "ability boundary".

The evaluation vacuum is another problem. If you open the official website of any world model company, you can almost always see the slogan "Ranked first on the global authoritative evaluation list".

The problem is that these evaluation lists themselves are immature. Some focus on visual realism, some on physical accuracy, and some on task completion rate. A model that ranks first on the visual list might rank last on the physical list.

This lack of unified standards allows each company to say its own thing. Ordinary people can't figure out whether these are different categories of the same list or cleverly arranged marketing slogans.

There's also a deliberately avoided "impossible triangle".

World models face three mutually restrictive indicators: spatial scale, visual fidelity, and real-time interactivity.

You can't simultaneously achieve "a large world, clear pictures, and smooth interaction". Fei-Fei Li's Marble is the best example: Version 1.1 has good image quality but a limited spatial range, while Version 1.1-Plus can generate large scenes but the image quality is blurry.

Matrix-Game 3.0 from Kunlun Wanwei can achieve real-time generation at 40 FPS under 720P, but the style and complexity of the demonstration scene are very limited.

Almost no product will actively admit its shortcomings. They prefer to show demonstration videos under optimal working conditions and hide the failures under extreme conditions. This selective display is creating a dangerous bubble.

Finally, the capital frenzy has also brought new speculative risks.

A notable phenomenon is that capital has shifted from chasing "veterans from big tech companies" to betting on young scholars from top universities. The two founders of Inverse Matrix Technology, one born in 1998 and the other in 2004, are from Peking University, and their first-round financing exceeded $10 million.

Their technical route is "reinforcement learning + world model", and currently, they only have papers, no products. This doesn't mean that young people can't do it, but in the chaotic paradigm period, capital is willing to pay a very high premium for the possibility of "defining the next-generation technology".

But most of these laboratory projects ultimately can't cross the gap from "paper to product". Even Turing Award winner Yann LeCun admits that commercialization will take a few years, let alone newly graduated doctoral students?

IV. Conclusion

The goal of world models is to enable AI to predict and even intervene in the physical world. So, if AI's prediction is wrong, who will take the responsibility?

Imagine a scenario: The world model of an autonomous driving car "imagines" a non-existent obstacle in the simulation, causing the vehicle to brake suddenly and be rear-ended by the car behind.

Should this blame be shifted to the algorithm engineer or the provider of the simulation data?

Imagine another scenario: The world model of an industrial robot wrongly predicts the movement trajectory of a part, crashing the entire production line. What is the insurance company's claim settlement standard?

In an even more extreme scenario: Someone uses a world model to generate a realistic fake 3D disaster video, causing panic on social media. Does the platform have a review obligation? How does the law define this "confusion between virtual and real" harm?

Currently, no company or country has given a clear answer to these questions. The ethical framework and legal boundaries of world models are far behind the development speed of technology.

When capital and the media focus on "who can create the most realistic virtual world", a more fundamental question has been put on hold: Are we really ready?

This might be the most underestimated variable in the world model track. It's not computing power, not data, not algorithms, but responsibility.

This article is from the WeChat official account "Intelligent Machine Island", author: Huo Rujun, published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The Revelation of the First Year of World Models: Motivations, Turmoil, and Hidden Reefs

I. Why the sudden all-in on world models?

II. Interrogation about technical standards

III. The deliberately avoided ambiguous areas

IV. Conclusion