Das Endspiel der intelligenten Fahrtechnik: Der Kampf um das "starke Gehirn" zwischen VLA und WA

Das "Endsignal" im Bereich der autonomen Fahrtechnik ist noch nicht ertönt. Das wahre Endspiel gehört den Spielern, die "Sprechen" und "Vorstellen" zu "Denken" verschmelzen können.

When the Li Auto i8 automatically detects the washed - away roadside on a mountain road in heavy rain, gently brakes, and plans a detour route; when the XPeng P7 Ultra precisely avoids a crossing stray cat and a suddenly merging truck using only visual sensors, the autonomous driving industry has quietly reached a turning point in a technological revolution.

The market elimination battles in the electric vehicle sector are already well - underway. From battery range, charging speed to in - vehicle cabin intelligence, competition has long driven the market into a red ocean. Even the ability of autonomous driving is changing from a controversial “plus point” to a “survival point” that decides the fate of an automaker.

In August this year, Li Auto, XPeng, and DeepRoute.ai successively announced within two weeks that they would integrate the VLA (Vision - Language - Action) large - scale model into their vehicles. The planning frequency jumped from 10 Hz to 20 Hz, and the end - to - end latency was reduced to less than 100 ms. Perhaps at the same time, Huawei demonstrated in its Songshanhu laboratory how a car equipped with a lidar “imagines into the next five seconds” continuously in a simulated environment and gets out of a difficult situation with heavy rain, traffic cones, and an oncoming bicycle using the WA (World Action) world model.

▲ Image/Screenshot from Xiaohongshu

Two different paths, but the same goal. VLA enables cars to “speak”, while WA enables them to “think”. Whoever gets from “speaking” to “thinking” first has the initiative in the final phase of the electric vehicle elimination battles.

The Era of End - to - End Solutions

At the beginning of the development of autonomous driving, hardware was undoubtedly the core of the competition. Automakers knew that for vehicles to be capable of autonomous driving, they must first “see”, “hear”, and “react quickly”. Therefore, they invested a lot of money and time in the development of sensors, chips, and other hardware.

Sensors are like the “eyes” and “ears” of a vehicle and can collect information about the environment. Different types of sensors such as lidar, cameras, and millimeter - wave radar have their own advantages and disadvantages. Automakers must select and combine them according to their technological strategies and budgets.

The chip is the “brain” of the vehicle and is responsible for quickly processing and making decisions based on the data collected by the sensors. High - performance chips can provide strong computing power and support complex algorithms to enable advanced autonomous driving functions. Chip giants like NVIDIA and Intel have continuously made progress in the performance and power consumption of their autonomous driving chips, which has strongly supported the development of autonomous driving and made them desirable partners for automakers.

▲ Image/NVIDIA's flagship intelligent driving chip Thor

In the early hardware - centric competition thinking, automakers assumed that “the number of sensors determines the perception ability”. However, this approach quickly fell into the double bind of high costs and low efficiency. Take lidar as an example: in 2020, a high - performance lidar cost over $10,000. A vehicle with three lidar sensors had an additional $30,000 in hardware costs alone. This led to the fact that early autonomous driving vehicles generally cost over 500,000 yuan and it was difficult for them to enter the mainstream market.

The early version of the XPeng P7 with two lidar sensors cost 80,000 yuan more than the version without lidar sensors with the same configuration. After its market launch, the monthly sales volume remained below 3,000 vehicles for a long time. Only after the introduction of a version with reduced lidar equipment could the sales volume increase.

In 2019, Tesla, as a pioneer of the end - to - end strategy, opened a new path for the development of autonomous driving. The core concept of this strategy is to train models with a large amount of real test - drive data so that vehicles can directly go from sensor input to control output, and the technology of autonomous driving can be rapidly developed.

Tesla used its large vehicle fleet and wide user base to collect a huge amount of real test - drive data. This data included various road conditions, weather conditions, and driving scenarios. Through the analysis and training of this data, the autonomous driving models were continuously optimized and improved until finally a series of advanced functions such as automatic navigation driving, automatic lane - changing, and automatic parking became possible.

After Chinese automakers saw the success of the end - to - end strategy, they quickly followed suit. They increased their investments in data collection and model training to gain a strong position in this competition for autonomous driving.

But the end - to - end strategy is not flawless. It has obvious limitations in dealing with rare scenarios, such as suddenly appearing pedestrians, illegally driving vehicles, and road conditions in bad weather. Since these scenarios rarely occur in real test drives, it is difficult for the end - to - end model to learn and train sufficiently from a small amount of data. Therefore, it is often difficult to make accurate judgments and decisions in these situations.

The Rapid Catch - up of VLA

The limitations of the end - to - end strategy laid the foundation for the rise of the VLA strategy.

At the end of 2023, Li Auto first introduced the concept of VLA technology. The core of this technology is to integrate the three modalities of vision, language, and action so that the autonomous driving system can “observe, infer, and decide” like a human being.

In contrast to the “data mapping” of the end - to - end strategy, the VLA system can convert the information captured by visual perception into linguistic descriptions, then draw logical inferences, and finally issue specific action instructions.

In the race for autonomous driving, the “early - start advantage” was long regarded as an insurmountable barrier. Huawei released its ADS (Advanced Driving System) as early as 2019 and was the technological standard in the industry for a while thanks to the combination of lidar and high - precision maps. Baidu Apollo started developing autonomous driving in 2013 and has invested over 50 billion yuan so far. However, the emergence of the VLA strategy enabled “latecomers” like Li Auto and XPeng to make a rapid catch - up and fundamentally change the competitive landscape in the industry.

While Li Auto users use their cars in daily life, the vehicle continuously collects various driving data, including road information, traffic situations, and driving behaviors. This data is not only large in quantity but also covers a wide variety of different scenarios and situations. It provides rich materials for the training of the VLA model. Through the analysis and mining of this data, Li Auto's development team can better understand the needs and driving habits of users and specifically optimize and improve the VLA model to increase its accuracy and adaptability.

XPeng, on the other hand, has invested in computing power and built a powerful cloud - training cluster to support the efficient training of the VLA model.

The development team can run multiple model - training projects simultaneously on the cloud - training cluster, which significantly increases the training performance. In addition, the cloud - training cluster is scalable and can be expanded according to the needs of computing power and storage space to meet the continuous iterations and optimizations of the VLA model.

Not all players are as large as Li Auto or XPeng. DeepRoute.ai, founded in 2019, only delivered 34,000 vehicles in 2024, but still decided to “go all - in on VLA”. CEO Zhou Guang once calculated: if you build 100,000 vehicles, each vehicle drives 50 kilometers per day, and the data transfer rate is 20%, you can collect 1.8 billion kilometers of data within a year, which is just enough to overcome the “cold - start death valley”.

To gain time, DeepRoute.ai has opened its DeepRoute IO 2.0 platform to five automakers to share data and computing power and thus increase the “number of vehicles with the technology”. On August 26 this year, DeepRoute.ai released the mass - producible version of VLA and claimed that the “Orin - X + Journey 5” dual - chip system can enable a planning frequency of 20 Hz and reduce the BOM cost to 5,500 yuan, which is 32% less than the Huawei MDC 810. For companies with low annual sales volume and limited financial resources, this is almost the only available “ticket”. Zhou Guang directly said: “VLA gives small and medium - sized automakers the opportunity for the first time to replicate the experience of leading companies at low costs. The time window is 18 months. If you miss it, it's over.”

Is WA the Ultimate Solution?

In contrast to the hype around VLA, Huawei and NIO chose a “more aggressive” technological strategy - WA (World Model).

The core concept of the WA strategy is that the autonomous driving system builds a “digital twin world” using cloud - simulation data to gain in - depth insights into the real world. In contrast to VLA's “from data to decision”, WA tries to make the system “understand the world first, then make decisions”. Many experts regard this strategy as the “ultimate solution” for autonomous driving.

Wang Jun, the leader of Huawei's ADS development, once explained the advantages of WA with a vivid metaphor: “If we regard the autonomous driving system as a student, VLA is like someone who studies for an exam with a huge number of practice questions. When he encounters a question he has never seen before, he doesn't know what to do. WA, on the other hand, first understands the teaching content and can always find a solution by applying the rules.” Li Bin of NIO wrote in an internal email: “WA gives the car ‘imagination’ instead of ‘memory’.”

Theoretically, the WA system can fundamentally eliminate the dependence on data in the VLA system. Especially in dealing with rare scenarios, it has stronger generality and adaptability.

But these advantages are currently still based on theory. To achieve commercial implementation, the WA strategy still has to overcome three challenges: financing, data, and the balance between simulation and reality. This currently only makes it possible for “giant companies”.

Building a digital twin world involves multiple areas, including hardware, software development, and scenario modeling. Huawei has so far invested over 2 billion yuan in the WA strategy. Just the server cluster for the digital twin platform cost 500 million yuan, and the annual power and maintenance costs are 80 million yuan. NIO specifically established a “World Model Laboratory” for the development of the WA system. By 2024, it had invested over 1.5 billion yuan in total, which accounts for 40% of its total R & D expenditure.

▲ Image/Huawei

Such R & D investment excludes most small and medium - sized automakers. The founder of a new electric vehicle company once openly said: “We want to develop WA, but we can't afford it. Just building a basic digital twin scenario would cost at least 500 million yuan, which is equivalent to our three - year R & D budget. We simply can't bear it.” In comparison, the R & D investment for the VLA strategy is only one - tenth of that for WA. It is more suitable for companies with limited financial resources.

VLA enables cars to “speak”, WA enables them to “imagine”. One might be the... (The original text seems incomplete here)

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。