Is the world model the ultimate answer for autonomous driving?
Image source: Visual China
Text | Xiao Man
Editor | Li Qin
In the past two or three years, when automobile manufacturers talked about intelligent driving, they would always mention various novel technical terms.
The world model is the trendiest term in the field of intelligent driving after end-to-end and VLA. Different companies have given it new names - XPeng has launched the "world base model", NIO's is called the "end-to-end world model", and Huawei's is called the "world behavior model" (WA). Besides them, Horizon, Li Auto, DeepRoute.ai, and Momenta are also working on world models.
However, just by looking at their press conferences, it's difficult to tell whether the world models they mentioned are the same thing. What problems does it actually solve, and where is it placed in the intelligent driving architecture?
From a broader perspective, the "world model" essentially aims to recreate the real world in the virtual world. It's a technology that enables artificial intelligence to understand the real world like humans, recognize physical laws, causal relationships between things, and environmental dynamics.
The world model is regarded by most scientists and technology companies as the key piece in the "AI for the physical world" technology expedition. Fei-Fei Li, a professor at Stanford University, once pointed out that spatial intelligence will be the focus of AI in the next decade, and the world model is the key technology for building spatial intelligence.
Scientists and technology companies at the forefront of the industry are still exploring, but the Chinese automotive industry has already occupied the position with various novel concept terms.
In fact, the "world models" discussed in the intelligent driving industry today only differ in terms of terminology, and there isn't much difference in the technical path. It's just an upgrade of the technical paradigm for the industry's original simulation tools. In a virtual world with higher restoration, higher granularity, richer scenarios, and greater freedom, it solves the problems of testing and verifying end-to-end models. All of this is to train an end-to-end intelligent driving model with better performance and a more human-like nature.
In other words, intelligent driving manufacturers and automobile companies aren't really creating a complete digital physical world. They're just building simulators using the concept of the world model.
Perhaps different companies have different expectations for the world model. However, as far as we know, as of now, the world models in the intelligent driving industry are only applied in the cloud and haven't been used in vehicles.
The popularization of end-to-end highlights the shortcomings of simulators
In the past two or three years, the intelligent driving solutions of the leading players have shifted from rule-based stacks to AI-driven ones, achieving a "formal" unity. Sensing, prediction, and planning have been integrated into a single network as much as possible, along with larger models and higher computing power. As automobile companies often say in their press conferences, "After end-to-end, intelligent driving is more like a human driving."
However, in practical applications, an anti-intuitive phenomenon has emerged: the new version of OTA after end-to-end isn't necessarily better. It may even "regress".
The core of the problem isn't that the model has become worse. It's that AI-driven models make evaluation and regression more difficult.
At that time, many intelligent driving practitioners believed that as long as the front-end was trained well enough, the vehicle would drive like a human. This approach wasn't ineffective. The early performance of end-to-end models amazed many intelligent driving practitioners. However, the "black box" nature of end-to-end models also brought side effects. When the model makes mistakes, it's difficult for R & D personnel to know why and how to prove that it won't make the same mistake next time.
Whether a model is good or not isn't just about "whether it's trained big enough and with enough data". It depends more on how you discover, define, and verify problems. Manufacturers have gradually realized that they need a better simulator to evaluate the model's performance during the model verification stage.
Most of the leading players have built world models as simulators. To enable Li Auto's VLA to conduct reinforcement learning in the simulation environment, Li Auto proposed a driving world model that includes the trajectories of the ego-vehicle and other vehicles in 2025, acting as a scoring teacher. Although XPeng has only publicly mentioned the "world base model", a technical term that's essentially unrelated to the world model, according to 36Kr Auto, XPeng is also using the world model for simulation testing to evaluate the algorithmic capabilities of the new version of the model.
The popularization of end-to-end has exposed the shortcomings of traditional simulators. "When end-to-end wasn't as popular before, the verification cost wasn't that high, and we could verify the system in segments. Now that it's end-to-end, we can't verify the system in segments anymore, and that's when the problems with the simulator become prominent," said an in - industry R & D personnel.
In the rule-based era, when automobile companies conducted simulations, they usually served two purposes. One was to reproduce the problems of mid - journey takeover by replaying the accident segments from road tests. The other was to use simulators to increase the data richness of corner cases by creating scripted scenarios of typical intersections, jaywalking pedestrians, and cutting - in vehicles in the simulator and running the system through them.
At that time, simulators mainly played the role of a "magnifying glass". However, after end-to-end, it's difficult for the model to separate responsibilities, and it's also difficult to systematically generate more detailed and controllable corner cases. It's even more difficult to support the large - scale closed - loop verification required by end-to-end models - and that's why the world model was introduced.
In the end-to-end era, the world model is the "coach" for intelligent driving models
"Currently, there's a certain gap between the level of world models of domestic automobile companies and that of Tesla, but the gap is less than a year," said an industry insider.
Tesla doesn't use the concept of the "world model" but rather the term "world simulator" (first mentioned by Ashok Elluswamy, the vice - president of Tesla's Autopilot, at ICCV last year). This simulator is trained based on Tesla's self - built massive dataset and generates future states based on the current state and the next action. Thus, it forms a closed - loop with the end-to-end basic model on the vehicle side for real - effect evaluation.
Tesla's neural network closed - loop simulation (Image source:
An industry insider pointed out that Tesla is more like using neural networks to "fit" the world. The rendering process is generated through calculations to minimize the explicit stacking of physical rules. The material library isn't completely predefined by humans but retains a certain probability weight and combination space. The advantage of this approach is that the model can have stronger generalization ability.
Most domestic automobile companies are taking a more "controllable" path. A supplier who communicated with 36Kr Auto said that Li Auto uses 3D Gaussian reconstruction - which is also one of the methods currently adopted by most automobile companies.
Regardless of the approach, in engineering, the world model ultimately points to the same position: automobile companies are using the world model as a "verification and counter - evidence system" in the end-to-end era. It's used to replay, rewrite, and expand possible scenarios in real - world driving in the cloud, check whether the output of the large model on the vehicle side is stable and reproducible, and turn "where it went wrong and why" into a traceable evidence chain.
The world model plays the role of a coach. A good coach can train excellent athletes. "As the cloud - based world model becomes stronger, theoretically, the capabilities of the end - side model trained should also become stronger," said an R & D personnel.
The core capabilities of the world model mainly lie in two aspects: one is the digital modeling and abstraction of the physical world; the other is to generate reasonable imagination and prediction of the physical world based on such modeling, such as predicting how the future world will change through a given picture.
The quality of the world model depends on whether it can generate sufficiently real and diverse data in the cloud. "If automobile companies only use the collected real - world data for simulation, they're obviously not building a world model. They're just creating a process for replaying data," said a product manager of a supplier.
The world model needs to learn the operating mode of the world from the data of the physical world. Therefore, the quality of the training data for the world model will significantly affect the quality of the model's output. Mao Jiming, the person - in - charge of the Jijia Vision product line, mentioned that "for generative models like the world model, the generated results will ultimately align with the characteristic distribution law of the input data. In the actual commercialization process of the world model, we found that if the data quality is only 60 points, the quality of the data generated by the world model based on it may only be 55 points."
Based on the world model, when automobile companies conduct simulations in the cloud, they can generate the required scenarios from all dimensions without limit and generate videos as training data according to instructions. "The efficiency is much higher than collecting real - world data and then training. The model iteration speed will also be far ahead," said an R & D personnel from a supplier.
However, these are all ideal results. "Compared with the simulators used in intelligent driving, or when there's no simulation information and we can only use offline - collected data for verification, the world model is already a significant upgrade, but it's still far from the ideal - state simulator."
The world model algorithm isn't mature yet, and there are many "hallucinations"
The industry is generally in the "early stage" now.
An R & D personnel from an automobile company told 36Kr Auto that domestic manufacturers can generate video clips of up to 30 - 60 seconds based on the world model, but the consistency of dynamic objects isn't good. There are significant problems in both spatio - temporal consistency and multi - perspective consistency.
The underlying layer of the world model is a generative model, and generative models inherently carry the risk of "hallucinations". "The most difficult part of the world model currently is how to ensure that the generated things are real. If it generates a person, how to ensure that their behavior and trajectory are possible in the real world," said a product manager of a supplier. "If the world model generates incorrect results, the model will learn wrong things, which will lead to very poor performance of the model deployed on the vehicle side."
An extreme example is that if the cars generated in the cloud are all moving sideways, the model will think that a car in the left - front will suddenly move to the right - front. During actual driving, the model may take braking actions.
If a simulator can't approximate the key causal relationships in the real world, such as the impact of slippery roads on braking distance, the probability of false detection of stationary objects under backlight, and the game strategies of other vehicles during lane - changing, the "corner cases" it generates may be false. Optimizing based on false problems is like wasting R & D resources on illusions.
In the view of many people, the bottleneck of the world model lies in data and computing power. However, Xia Zhongpu, the former person - in - charge of the "end - to - end" model for assisted driving at Li Auto, agrees more with Yann LeCun's view: "There hasn't been a major breakthrough in the world model algorithm. The self - supervised training of image models hasn't found a smooth paradigm like that of language models."
One reason why language models can be rapidly scaled up is that language itself has a high information density, and each word carries clear semantic constraints. In contrast, the information density of images is low, and for "driving decisions", only a very small part of the information in an image is useful.
For example, the model doesn't need to predict the trajectory of a car far behind or the changes in distant buildings. These are all noisy data. However, it must predict whether the car in front of the current lane will suddenly brake hard, whether the adjacent car will cut in, and whether a pedestrian will suddenly cross the road. The model first needs to know "where to focus its attention".
"Currently, intelligent driving algorithms can't extract enough image information useful for driving," said Xia Zhongpu. An image may have millions of pixels, but only about 20 pixels are related to decision - making, and the rest are noise. The model first needs to learn to extract that 1‰ or even 1‱ of effective signals from the noise before it can talk about how to organize the signals into a structure for reasoning and prediction.
In Xia Zhongpu's view, since the world model algorithm hasn't made a breakthrough, it's too early to talk about whether there's enough data or how much computing power is needed. Also, because there hasn't been a clear breakthrough in the basic technology of the world model, the investment of automobile companies is more for research purposes, and even some automobile company bosses are confused about it.
If the world model is well - developed and the computing power can support it, it can be deployed on the vehicle side. "Currently in China, the world model is basically used as a simulation system, and the understanding of the technology at the intelligent driving decision - making level isn't deep enough," said Xia Zhongpu.
This can also explain an apparent contradiction: why all companies are talking about the world model, but users don't feel much difference. It's because most people's world models are still in the first stage of "training and verification" and haven't entered the second stage of "supporting decision - making and planning".
"Deploying the world model on the end - side is the most difficult," said Xia Zhongpu.
Currently, no company has applied the world model on the end - side. He also pointed out that "using large - model methods to model the physical world, predicting the development and changes of the world through interaction with the physical world, and then influencing the world to develop in a favorable direction through decision - making. If the world model can reach this level, problems related to autonomous driving and robots can be solved."