HomeArticle

It's already 2026. Is it really necessary to argue about which is better, VLA or the world model?

飞说智行2026-03-20 17:14
VLA + World model should be a path leading to the implementation of physical AI.

The intelligent driving industry has been in an uproar recently.

In the past two days, at SAIC Volkswagen's ID. ERA technology launch event, Cao Xudong, the CEO of Momenta, officially announced that their Momenta R7 Reinforcement Learning World Model is about to be launched and will be globally debut on SAIC Volkswagen's new flagship SUV, the ID.ERA 9X.

After launching a one - stage end - to - end large model based on reinforcement learning last year, Cao Xudong chose the World Model + Reinforcement Learning route for the development of the new - generation model. Thus, besides players represented by Huawei Qiankun, there is another player in the world model route.

Meanwhile, Li Auto released their new - generation autonomous driving base model MindVLA - o1 at the GTC conference in the past two days.

According to the introduction by Zhan Kun, the person in charge of Li Auto's base model, this base model constructs an autonomous driving basic model for physical world intelligence through six major technological innovations, enabling autonomous driving to see farther, think deeper, drive more steadily, evolve faster, and be deployed more efficiently.

In the past one or two years, Li Auto has been iterating its intelligent driving technology at a very fast pace. From launching the end - to - end + VLM dual - system model in 2024, to unifying spatial understanding, language understanding, and action decision - making into the same model framework, the VLA driver large model, last year, and then to this year's MindVLA - o1, it can be said that it maintains an iteration efficiency of one - generation model per year.

XPeng Motors, also in the VLA camp, recently officially launched and started mass - producing their second - generation VLA. Only four months have passed since they launched the second - generation VLA. Compared with the traditional VLA architecture, they were the first to propose a new VLA architecture that removes the two explicit translation processes.

In the past two years, players like Li Auto, XPeng, and DeepRoute.ai have gradually evolved their algorithm architectures from end - to - end to the VLA model architecture. Meanwhile, players like Huawei Qiankun have chosen the world model architecture that focuses more on understanding the real world.

Thus, the entire intelligent driving industry has started to debate the pros and cons of VLA and the world model. Supporters of both camps believe that the route they adhere to will become the ultimate solution for the intelligent driving industry. After all, theoretically, both routes have their own shortcomings.

With Momenta betting on the world model and Li Auto, XPeng, and DeepRoute.ai accelerating the optimization of the VLA model, such debates have become even more intense. However, in the view of Feishuo Zhixing, the two technological routes of VLA and the world model may not be in opposition.

1. Only Different Divisions of Labor, No Absolute Opposition

The traditional VLA faces obvious challenges.

Firstly, the alignment efficiency of understanding the 3D spatial environment, language thinking, and reasoning to output specific driving behavior trajectories is not high. Secondly, there is still the problem of long - tail scenarios. Finally, the VLA model often includes the capabilities of LLM, which brings high computing and memory costs.

To solve these problems, Li Auto proposed MindVLA - o1. This model is a native multi - modal MoE Transformer. This means that the model has the ability to unify and align multi - modal training such as vision, language, and action, as well as strong generalization ability.

In terms of perception, they introduced a 3D ViT Encoder, which can fuse LiDAR data and visual data earlier and directly construct a 3D spatial representation at the encoding stage, enabling the model to more naturally understand the physical spatial structure of the real world.

Moreover, they also introduced a feed - forward 3DGS representation (Feedforward 3D Representation) to improve the model's understanding of the environment.

For high - level intelligent driving and even autonomous driving, it is not enough to only understand the current environment. It is also necessary to predict the world. The industry generally thinks of using a world model with tens of billions of parameters, but such a large - scale model is difficult to run on the vehicle, and thus the vehicle cannot obtain the "prediction" ability.

In response, while introducing "Next - state prediction" as a self - supervised signal in the training process and retaining language ability for multi - modal reasoning, Li Auto also adopted a Predictive Latent World Model.

The so - called Predictive Latent World Model is, in short, an extremely compressed world model. Inside it are not real - world images or point clouds, but abstract vectors after encoding.

During training, the multi - modal perception data should be compressed and represented first. Then, understand the current situation and deduce future environmental changes in the latent space. Finally, use the results for the joint training of in - model algorithm prediction and driving decision - making. Compared with prediction at the real - data level, it is much faster and consumes less computing power.

As a result, the algorithm model has the ability of "imagination", that is, the "Generative Multimodal Thinking" ability mentioned by Zhan Kun. At the same time, the computing cost is compressed to a level that the vehicle - end computing power can support for real - time calls.

To generate the final driving trajectory from the information of the vehicle - end thinking and imagination, Li Auto designed the Unified Action Generation module, which is mainly composed of three major capabilities:

Firstly, a mixture - of - experts model (VLA - MoE) is adopted, in which an Action Expert is specifically introduced, focusing on generating high - precision driving trajectories. Secondly, Parallel Decoding is adopted, which allows all trajectory points to be generated in parallel, which is very important for long - time - series trajectory prediction.

To ensure the quality of the parallel - generated trajectories, they also introduced the diffusion optimization method of Discrete Diffusion, allowing the model to continuously optimize the trajectories through multiple rounds of iterations, and finally obtain trajectories that are continuous in space, stable in time, and meet the vehicle's dynamic constraints.

Facing complex traffic environments, trajectory generation needs to be fast, stable, and cover different decision - making paths. It was difficult to meet the above three requirements before, but now Li Auto has provided a combined solution.

The model is built, and the next question is where to find the data.

In previous years, the training of intelligent driving algorithms mostly relied on real human - driving data, but these data cannot cover all long - tail scenarios (Corner cases).

Therefore, Li Auto built an extensible world simulator, upgrading the traditional step - by - step optimization - based reconstruction to a feed - forward scene reconstruction method, enabling the model to instantaneously generate large - scale, high - fidelity driving scenarios, thus supporting large - scale parallel training.

At the same time, they also combined this feed - forward scene generation with generative models, enabling the simulated environment not only to reconstruct real scenarios but also to expand, edit, and generate new scenarios.

This extensible world simulator is in a Closed - loop RL framework. This means that the model can not only be trained based on real data but also encounter scenarios that are difficult to meet in reality in the world simulator, and then explore, optimize, and iterate in it, thus approaching the exhaustion of solving long - tail problems.

To get the model on the vehicle, Li Auto adopted an analysis framework called Roofline, establishing an accurate mapping relationship between model accuracy and inference latency. By testing about 2000 different model configurations and verifying them on the NVIDIA Drive Orin and Thor platforms, they finally found the optimal balance between accuracy and latency.

According to Zhan Kun's introduction, based on this set of software - hardware collaborative design methods, they shortened the time for model architecture exploration from several months to a few days, which greatly improved the design efficiency and deployment speed of the end - side VLA model.

Through the above analysis, it can be seen that the Li Auto MindVLA - o1 model architecture basically solves the challenges of the traditional VLA model. MindVLA - o1, together with the VLA data engine MindData, the multi - modal world model MindSim, and the reinforcement learning module RL Infra, constitutes the panoramic view of Li Auto's base model for physical world intelligence.

It is worth noting that in Li Auto's base model architecture, both the world model and the VLA architecture are included, and these two models have different divisions of labor and collaborative integration in the entire framework. There are many such cases.

At the ICCV conference in November last year, Ashok Elluswamy, the vice - president of Tesla's autonomous driving, shared the latest progress of Tesla's algorithm architecture. They not only adopted technologies such as 3D Gaussian features and Chain of Thought (COT) to improve data quality and model interpretability but also established a closed - loop simulation system called the "Neural World Simulator".

 Tesla's closed - loop simulation neural network model, image source: Tesla AI

It can be seen that in Tesla's current algorithm R & D system, both the VLA model and the world model are included. Feishuo Zhixing has elaborated on this in the article "In - depth | Tesla is No Longer the 'Standard Answer' in the Intelligent Driving Industry", so it will not be repeated here.

In addition, whether it is XPeng's second - generation VLA model that has started mass - production or the VLA base model revealed by DeepRoute.ai at the GTC in the past two days, they all need the data closed - loop capabilities provided by the world model, such as large - scale simulation training, scenario data generation, and driving behavior prediction and deduction, as support.

These players integrate the two technological routes of VLA and the world model to build their own algorithm base models, which is also a preparation for a greater goal.

2. Physical AI, Autonomous Driving is Just the Starting Point

In the research and development of intelligent driving or autonomous driving, the industry's goal is the same - how to build a "digital brain" for machines that can operate in the real world?

Take driving a car as an example. After systematic learning at a driving school and long - term driving practice, one can make a judgment on the traffic scene in front of them within a few hundred milliseconds and drive the vehicle to make a safe evasive action. For "veteran drivers", such ability has even become muscle memory.

In contrast, with the gradual popularization of intelligent assisted driving, it has now infinitely approached the "easy - to - use" stage from the "usable" stage. However, there is still a long way to go to reach the level of a real "loved - to - use" system and the level of a human veteran driver.

At the recent GTC, Li Auto has made some progress in the journey of building a "digital brain".

Borrowing the above - mentioned Li Auto's base model for physical world intelligence, from visual perception to world understanding and reasoning, to action decision - making, to continuous optimization through reinforcement learning, and finally to system efficiency and hardware collaboration. In Zhan Kun's view, it is very similar to an animal's brain.

Panoramic view of Li Auto's physical AI framework

Visual information first enters the visual cortex, then is reasoned and planned in the prefrontal cortex, and then specific actions are generated by the motor cortex. Reinforcement learning is carried out through the dopamine system, and finally, the muscles are guided to complete the actions through an efficient nervous system. It is also a closed - loop.

This "digital brain" ability should have been applied to Li Auto's intelligent assisted driving. Just give an instruction to Li Auto's car, "Help me park the car next to that orange car in front." You can see that the on - board system can quickly understand the environmental semantics and generate the corresponding driving trajectory to complete this task.

Through Zhan Kun's demonstration, based on the same base model, they also made a robot pick up a bottle and pour Yakult into a cup on the table. In other words, the same set of base model can not only control vehicles but also control robots.

XPeng and Tesla also have the same strategy.

According to He Xiaopeng, the CEO of XPeng Motors, the second - generation VLA will not only connect the path from L2 intelligent assisted driving to L4 autonomous driving but also be applied to the research and development of robots and flying cars.

Because in his view, besides autonomous driving, robots and flying cars also belong to physical AI. To connect these physical AIs at the algorithm level, a base model and a complete set of AI infrastructure are needed.

Therefore, at the beginning of this year, XPeng merged the autonomous driving center and the intelligent cockpit center into a general intelligent center, which is led by Liu Xianming, the former person in charge of autonomous driving. This center is regarded by He Xiaopeng as a new AI organization for cars + robots.

As for Tesla, news that their humanoid robot and FSD have been