HomeArticle

Tesla, Huawei, and New Entrants in a Decisive Battle: The World Model War

汽车公社2025-09-12 10:42
The intelligent driving circle is in an uproar over the "world model".

The intelligent driving circle is in an uproar over the "World Model".

It's all the fault of the "World Model" (WM). For example, as a reader, can you distinguish between WM, WEWA, VLM, and VLA?

Indeed, after the "end-to-end" large model became popular, the emergence of the "World Model" has enriched and complicated the concept of high-end intelligent driving. However, problems have arisen. What exactly counts as a real "World Model"? And what's the relationship with VLA?

Recently, someone has been "exposing fakes". "I don't know which Chinese peers have truly developed VLA instead of a deformed version. From what I've seen, some companies have created a grafted VLA. As far as I know, our company is the only one in China that has truly developed VLA."

These were the words spoken by He Xiaopeng during a group interview after the launch of the new XPeng P7 on August 27th. Although he didn't specify the target, Li Auto was the only company that had announced the mass production of VLA in vehicles before XPeng.

Moreover, there are those who are dissatisfied with both VLA and the World Model and claim to develop WEWA.

On the same day, Jin Yuzhi, the CEO of Huawei's Intelligent Automotive Solution Business Unit, stated, "We won't follow the VLA path. We believe this path may seem like a shortcut, but it's not the path to true autonomous driving."

The reason is that "Huawei places more emphasis on WA, which is World Action, skipping the Language (L) part... It directly controls the vehicle through information input from Vision instead of converting various information into language and then using a large language model to control the vehicle."

So, we need to answer two questions: Why did XPeng criticize its peers' VLA? And why did Huawei also speak negatively about VLA?

On the vehicle side or in the cloud?

Let's first talk about VLA (Vision-Language-Action Large Model). In fact, before VLA became popular, end-to-end + VLM was one of the mainstream technical solutions in the intelligent driving industry. We can understand that VLA is an evolved form of technology based on end-to-end + VLM, which solves some limitations of end-to-end + VLM.

Similarly, to some extent, the more advanced "native integration" mode of VLA also depends on the technical accumulation of the "plug-in" mode of end-to-end + VLM.

However, although VLA has good interpretability, its spatial perception ability is weak, which is why Huawei skipped the Language (L) part and directly adopted WA (World Action).

While some oppose it, some support it. Yuanrong Qixing is very supportive of VLA. When Zhou Guang, the CEO of Yuanrong Qixing, released his company's VLA model, he said that "the lower limit of the VLA model has exceeded the upper limit of the end-to-end model." Zhou Guang also said that "voice control of the vehicle is just the basic ability of VLA. The most difficult parts are the Chain of Thought (CoT) and long-term sequential reasoning. These are the real core capabilities of VLA."

Now, let's talk about XPeng Motors. Why did it criticize its peers and claim that only it has a real VLA? The reason is hard to say, but the new XPeng P7 model has significantly upgraded its hardware configuration, equipped with three Turing chips at once, and plans to install VLA through OTA in September.

Of course, Li Auto is different from XPeng. It uses a dual-speed system on the vehicle side. The fast system is end-to-end E2E, and the slow system's VLM (Vision-Language Model) has a deployment parameter of 2 billion (2B).

Subsequently, Li Auto upgraded its system based on end-to-end + VLM. At NVIDIA's Spring GTC Conference in 2025, Jia Peng, the person in charge of Li Auto's autonomous driving technology R & D, said that Li Auto designed and trained a base model from scratch, which will support Li Auto in achieving the mass production of the MindVLA (Vision-Language-Action) intelligent driving algorithm model in vehicles.

In fact, Li Auto's so - called base model is the World Model, which is deployed in the cloud. It takes "simulation training + scenario verification" as the core and serves as an "examination system" for the end-to-end + VLM system. Li Auto just uses the concept of VLA on the vehicle side for marketing.

Obviously, XPeng looks down on deploying VLA on the vehicle side because the parameter quantity is far from enough. Li Liyun believes that the "end-to-end" model on the vehicle side is too small to learn some things effectively, but through the capabilities of the large cloud model, some real intelligent "emergence" can occur.

Previously, at the launch of the 2025 XPeng X9, Li Liyun, the vice - president of XPeng's autonomous driving, said that XPeng is developing an ultra - large - scale autonomous driving large model with 72 billion (72B) parameters, namely the "XPeng World Base Model".

The XPeng World Base Model is a multi - modal large model with a large language model (LLM) as the backbone network, trained with a large amount of driving data. It has the capabilities of visual understanding, chain reasoning, and action generation. XPeng's method is to deploy the base model on the vehicle side through cloud distillation of a small model. From the cloud to the vehicle side.

Li Liyun also said, "The truth is that the simplest way is the best. Since we don't need to consider deployment for now, we use the simplest model, the purest architecture, and a large amount of data to achieve an intelligent 'emergence' that exceeds expectations. What may seem like a surprise in the current 'end-to-end' model will become a daily occurrence in the future. This is our biggest difference."

We can't ignore the fact that during the industry's evolution, it took a long time to evolve from the two - stage end - to - end to the one - stage end - to - end. Whether it's VLA or the World Model, they are all new methods in the trial - and - error stage, and there's no absolute right or wrong. The current disputes are actually due to competition.

Regarding these concepts, a relevant person in charge at Horizon stated at the HSD Experience Day in response to my question, "Whether it's end - to - end + VLM, VLA, or the World Model, in essence, they are all end - to - end. I think in China, people overemphasize new concepts."

The pros and cons of the "plug - in" approach

Speaking of which, who proposed the concept of the "World Model" (the concept has been around for a long time)? It was Tesla. Elon Musk proposed the concept of the "World Large Model".

What's the function of this World Model? To achieve autonomous driving on all road conditions globally, Tesla embedded a large AI model between perception and decision - making, mainly to build a virtual environment for learning and verifying autonomous driving capabilities.

The approach is to first convert real - world data into a virtual environment, which is the so - called "reconstruction". Then, the virtual environment helps the system verify and optimize its capabilities under different conditions, that is, "generate" data. This "plug - in" large AI model is closely connected to the decision - making, planning, and control part.

In China, NIO was the first to propose this concept. At the 2024 NIO IN (NIO Innovation Day), Ren Shaoqing, the vice - president of NIO's intelligent driving R & D, released the NIO World Model (NWM) and announced that NIO's intelligent driving has shifted from "perception - driven" to "cognition - driven".

Of course, although they are all called World Models, there are differences between Musk's WM, NIO's NWM, and Huawei's WEWA.

Specifically, NIO's World Model aims to build a parallel world engine directly on the vehicle side in one step. In other words, it uses a dual - architecture of cloud training + vehicle - side reasoning and directly generates trajectory planning through a generative model (such as SORA), that is, directly generating control instructions from raw sensor data, skipping the language intermediate layer (L).

Here's an aside. According to a more professional view, the World Model is video generation plus prompt control. There are four main types of video generation: those based on Generative Adversarial Networks (GAN), diffusion models, autoregressive models (basically transformers), and masked models.

Among them, diffusion models are further divided into Stable Video Diffusion (SVD) and Stable Diffusion (SD). It's said that Tesla uses SVD. The well - known SORA is a composite model. The core components of the SORA model include DiT, VAE, and ViT (this is too technical, so we'll skip it).

NIO's vision is "no manual annotation required". The underlying logic is to integrate "perception - decision - control" into a unified generative model, and everything is completed instantaneously on the vehicle side.

However, there's a flaw in this vision and operation. It requires extremely high computing power on the vehicle side, and the real - time optimization problem of the generative model has not been fully solved. It wasn't until the end of May 2025 that NWM was officially fully pushed. The revolution is not yet successful, and comrades still need to work hard!

The cloud + vehicle - side WEWA model proposed by Huawei has the same principle as NIO's WM. Among them, the cloud - based WE (World Engine) is like an "AI driving school", and the vehicle - side WA (World Action Model) is an "AI driving brain" using a one - stage end - to - end architecture.

In terms of computing power, the total parameter scale of Huawei's WA is equivalent to an 8 - billion - parameter (8B) model, and the actual activated computing power burden is equivalent to a 2 - billion - parameter (2B) model. Huawei claims that the vehicle - side computing power consumption is reduced by 75%. Please note this data and compare it with Li Auto's.

Actually, in the end, the reason why the generative World Model is used to solve the data problem in intelligent driving is that the World Model can generate Corner Case data, allowing the intelligent driving system to optimize its perception and decision - making abilities through the cycle of "state → action → reward" in this virtual environment. Moreover, it requires the joint action of the vehicle side and the cloud. So, since different perspectives are being discussed, it's better to test these models in practice.

Regarding VLA, a relevant market person in charge at Horizon said, "I may be more optimistic about the World Model. At the same time, we should always come back to one point: the ultimate consideration for adopting new technologies is the product's return. Because all these things ultimately boil down to an end - to - end model. If it can't bring returns, I won't use it."

Another practical point is that "for all new methods, the first 50% of the returns are easy to obtain, but the last 50% are extremely difficult. However, if you haven't fully achieved the returns from end - to - end and try to obtain returns from other methods, there will be many problems. So, in the end, I think there's only one criterion for evaluation: Does this method bring high returns in the product?"

This article is from the WeChat official account "C Dimension". Author: Wang Xiaoxi, Editor - in - Chief: Beian, Editor: Wang Yue. Republished by 36Kr with permission.