HomeArticle

The ultimate outcome of intelligent driving: The battle of "powerful brains" between VLA and WA

新熵2025-09-10 20:14
The "final whistle" in the field of intelligent driving has not yet sounded. The real end-game belongs to those players who can integrate "talking" and "imagining" into "thinking".

When the Li Auto i8 automatically identifies a washed - out road shoulder on a mountain road during a heavy rain, smoothly decelerates, and plans a detour route; when the XPeng P7 Ultra precisely avoids a stray cat crossing the road and a suddenly lane - changing truck using only visual sensors, the intelligent driving industry has quietly reached the critical point of a technological revolution.

The market elimination round of new energy vehicles is more than halfway through. From battery range and charging speed to cockpit intelligence, the market has long entered the red - ocean competition due to intense rivalry. Even intelligent driving ability is changing from a controversial “bonus item” to a “survival item” that determines the fate of automakers.

In August this year, Li Auto, XPeng, and DeepRoute.ai successively announced the integration of the VLA (Vision - Language - Action) large model into vehicles within two weeks. The planning frame rate jumped from 10Hz to 20Hz, and the end - to - end delay was compressed to within 100ms. Perhaps at the same time, Huawei was demonstrating in its Songshan Lake Laboratory that a sedan equipped with a lidar could “imagine” the next five seconds continuously in a simulator and rescue itself from a dead - end situation of heavy rain, traffic cones, and a reverse - traveling tricycle, using the WA (World Action) world model.

▲ Image/Screenshot from Xiaohongshu

Two different routes lead to the same endgame. VLA enables cars to “speak”, and WA allows cars to “think imaginatively”. The one who turns “speaking” into “thinking” first will gain the right to serve in the final stage of the new energy vehicle elimination round.

The Back - End End - to - End Era

In the early development stage of the intelligent driving industry, hardware was undoubtedly the core of competition. Automakers knew well that to enable vehicles with intelligent driving capabilities, they first needed to make the vehicles “see” clearly, “hear” accurately, and “react” quickly. Therefore, they invested a large amount of capital and effort in hardware such as sensors and chips.

Sensors are like the “eyes” and “ears” of vehicles, capable of perceiving information about the surrounding environment. Different types of sensors, such as lidars, cameras, and millimeter - wave radars, have their own advantages and disadvantages. Automakers need to select and combine them according to their own technological routes and cost budgets.

Chips are the “brains” of vehicles, responsible for quickly processing and making decisions based on the data collected by sensors. High - performance chips can provide powerful computing capabilities, supporting the operation of complex algorithms, thereby enabling more advanced intelligent driving functions. Chip giants like NVIDIA and Intel have continuously made breakthroughs in computing power and power consumption with their autonomous driving chips, providing strong support for the development of intelligent driving and naturally becoming the preferred partners of automakers.

▲ Image/NVIDIA's flagship intelligent driving chip Thor

However, in the previous logic of the hardware competition, automakers generally believed that “the number of sensors determines the perception ability”. But this approach quickly fell into the dual dilemmas of high cost and low efficiency. Take lidars as an example. In 2020, the cost of a high - performance lidar exceeded $10,000. For models equipped with three lidars, the hardware cost alone increased by $30,000, which directly led to the fact that the prices of early intelligent driving models were generally over $500,000, making it difficult for them to enter the mainstream market.

The early version of the XPeng P7 was equipped with two lidars, and its price was $80,000 higher than the version without lidars in the same configuration. After its launch, its monthly sales remained below 3,000 units for a long time. It wasn't until a version with a simplified lidar configuration was launched that the sales volume increased.

In 2019, as a pioneer of the end - to - end route, Tesla opened up a new path for the development of intelligent driving technology. The core idea of this route is to train models with a large amount of actual road - test data, enabling vehicles to directly go from sensor input to control output, thus achieving rapid iteration of autonomous driving technology.

Tesla used its large fleet size and extensive user base to collect a vast amount of actual road - test data, covering various road conditions, weather conditions, and driving scenarios. Through the analysis and training of this data, the autonomous driving model was continuously optimized and improved, ultimately achieving a series of advanced functions such as automatic assisted navigation driving, automatic lane - changing, and automatic parking.

After seeing the success of the end - to - end route, domestic automakers immediately followed suit. They increased their investment in data collection and model training, hoping to gain a foothold in this intelligent driving competition.

However, the end - to - end route is not perfect. It has obvious limitations when dealing with long - tail scenarios, such as suddenly appearing pedestrians, illegally driving vehicles, and road conditions in bad weather. Since these scenarios occur less frequently in actual road tests, it is difficult for the end - to - end model to fully learn and train with a small amount of data. Therefore, it often has difficulty making accurate judgments and decisions in these situations.

The Lightning Counterattack of VLA

The limitations of the end - to - end route have laid the groundwork for the rise of the VLA route.

At the end of 2023, Li Auto was the first to propose the concept of VLA technology. Its core is to integrate three modalities: vision, language, and action, enabling the intelligent driving system to “observe, reason, and make decisions” like a human being.

Different from the “data mapping” of the end - to - end approach, the VLA system can convert the information perceived by vision into language descriptions, then conduct logical reasoning through a language model, and finally output specific action instructions.

In the intelligent driving arena, the “first - mover advantage” was once regarded as an insurmountable barrier. Huawei launched its ADS (Advanced Driving System) as early as 2019. With the combination of lidar and high - precision maps, it once became the industry's technological benchmark; Baidu Apollo started to layout intelligent driving in 2013 and has cumulatively invested more than $50 billion. However, the emergence of the VLA route has enabled latecomers like Li Auto and XPeng to achieve a lightning counterattack, completely rewriting the industry's competitive landscape.

During the daily use of Li Auto vehicles, the cars continuously collect various driving data, including road information, traffic conditions, and driving behaviors. This data is not only large in quantity but also covers a variety of different scenarios, providing rich materials for the training of the VLA model. Through the analysis and mining of this data, Li Auto's R & D team can gain in - depth insights into users' needs and driving habits, and optimize and improve the VLA model accordingly to enhance its accuracy and adaptability.

XPeng Motors has increased its investment in computing power and built a powerful cloud - based training cluster, providing strong support for the efficient training of the VLA model.

Its R & D team can use the cloud - based training cluster to run multiple model training tasks simultaneously, greatly improving the training efficiency. In addition, the cloud - based training cluster is scalable and can increase computing resources and storage capacity at any time according to R & D needs, meeting the requirements for continuous iteration and optimization of the VLA model.

Of course, not all players have the scale of Li Auto and XPeng. DeepRoute.ai, founded in 2019, only delivered 34,000 vehicles in 2024 but chose to “go all - in on VLA”. Its CEO, Zhou Guang, calculated an account: if 100,000 vehicles are produced, each vehicle runs 50 kilometers per day, and the data upload rate is 20%, 1.8 billion kilometers of data can be accumulated in a year, just crossing the “cold - start death valley”.

To save time, DeepRoute.ai opened its DeepRoute IO 2.0 platform to five OEMs to share data and computing power in exchange for “vehicle installation volume”. On August 26 this year, DeepRoute.ai released the mass - production version of VLA, claiming that the dual - chip solution of “Orin - X + Journey 5” can achieve a 20Hz planning frame rate, reducing the BOM cost to $5,500, 32% lower than Huawei's MDC 810. For enterprises with low annual sales and limited funds, this is almost the only available “ticket”. Zhou Guang said bluntly: “VLA gives small and medium - sized automakers the first opportunity to replicate the experience of leading companies at low cost. The window period is 18 months, and if you miss it, you're out.”

Is WA the Ultimate Solution?

Different from the hype around VLA, Huawei and NIO have chosen a more “radical” technological route - WA (World Model).

The core logic of the WA route is to enable the intelligent driving system to build a “digital twin world” through cloud - based simulation data, thereby achieving a deep understanding of the real world. Different from VLA's “from data to decision - making”, WA tries to make the system “understand the world first and then make decisions”. This idea is regarded by many experts as the “ultimate answer” for intelligent driving.

Wang Jun, the R & D leader of Huawei's ADS, once used a vivid metaphor to explain the advantages of WA: “If the intelligent driving system is compared to a student, VLA prepares for exams by doing a large number of exercises and will be at a loss when encountering unseen questions; while WA first understands the knowledge points and can derive answers through rules no matter what new questions it encounters.” Li Bin of NIO also said in an internal email: “WA gives the car ‘imagination’ rather than ‘memory’.”

Theoretically, the WA system can fundamentally solve the data dependence of the VLA system, especially when dealing with long - tail scenarios, it has stronger generality and adaptability.

However, these advantages are currently only theoretical. To achieve commercialization, the WA route still needs to overcome the three challenges of capital, data, and the balance between simulation and reality, which makes it temporarily exclusive to industry giants.

Building a digital twin world covers multiple fields such as hardware equipment, software R & D, and scenario modeling. Huawei has invested more than $2 billion in the WA route. Only the server cluster of its digital twin platform cost $500 million, and the annual power and maintenance cost is as high as $80 million. NIO established a “World Model Laboratory” specifically for the R & D of the WA system. As of 2024, it has cumulatively invested more than $1.5 billion, accounting for 40% of its total R & D expenses.

▲ Image/Huawei

This level of capital investment has excluded most small and medium - sized automakers. The founder of a new - force automaker once admitted: “We don't not want to develop WA, but we can't afford it. Just building a basic digital twin scenario requires at least $500 million, which is equivalent to our three - year R & D budget and is simply unaffordable.” In contrast, the R & D investment of the VLA route is only one - tenth of that of WA, making it more suitable for enterprises with limited funds.

VLA enables cars to “speak” first, and WA enables them to “imagine” later. The former may be the current key point of the competition, while the latter may be the end - point in three years. For Li Auto and XPeng, VLA is the pass for a counterattack; for Huawei and NIO, WA is the cornerstone of their moat. For more brands with annual sales of less than 100,000 vehicles, they can only desperately board the ship during the window period, even if it means becoming a “contract manufacturer”.

The “final whistle” in the intelligent driving field has not sounded yet. The real end - game belongs to those players who can integrate “speaking” and “imagining” into “thinking”. In this war without gunpowder smoke, only those enterprises that can both meet the current market demand and foresee future technological trends can emerge victorious in the new energy vehicle elimination round.

References:

Economic Observer, “Li Auto's VLA ‘Long March’”

Huxiu, “The New Round of Intelligent Driving PK Enters the Real - Battle Phase”

Yuanchuan Auto Review, “Let Some Assisted Driving Learn to Think First”

42 Garage, “DeepRoute.ai Releases VLA Model: Does It Start with Making AI Learn to Be Afraid?”

This article is from the WeChat official account “New Entropy” (ID: baoliaohui), author: Fushen. It is published by 36Kr with authorization.