Von der technologischen Route bis zur Personalwechsel: Warum hat das intelligente Fahren wieder mit der "Neukreation von Begriffen" begonnen?
High - quality intelligent driver assistance systems, NOA without maps, end - to - end solutions, VLA, WEWA architecture, NWM ······ Every few months, new terms emerge in the automotive industry and the intelligent driver assistance system industry, indicating the rapid development of technologies in this field.
However, there are also problems behind this rapid technological development. Vehicles that customers bought new a year ago often can no longer be supported by new technologies. Even customers' cognition can hardly keep up with these new terms. There are also changes within automotive companies. From rule - based systems to end - to - end solutions and then to world models and physical AI architectures, the intelligent driver assistance system departments in new automotive manufacturers have to constantly deal with personnel changes and the departure of management.
The intelligent driver assistance system industry agrees that the last quarter of this year to the first half of next year is a crucial period for the implementation of driver assistance technologies. With upgrades based on world models and VLA, the advantages of automotive companies that rely on self - development and solution specialists such as Momenta, Yuanrong Qixing, and WeRide can change at any time.
From Rule - Based Systems to End - to - End Solutions and Then to World Models
Rule - based driver assistance systems are based on core modules such as perception, prediction, planning, and control, which is referred to as a "modular solution" in the industry. The advantage is that these systems can be easily mass - produced. However, the disadvantages are also obvious: The four independent modules work in series, resulting in long latencies and high information loss. Therefore, vehicles often encounter difficulties on the road because their ability to interact with other road users is limited.
In August 2023, Tesla introduced the test version FSD V12 based on an end - to - end solution. Since then, "end - to - end" has become a hot topic in the Chinese intelligent driver assistance system industry. Huawei, XPeng, NIO, and Li Auto have successively developed similar solutions, and solution specialists such as Momenta have also presented end - to - end solutions.
Although rule - based systems and end - to - end solutions basically aim to imitate and learn human driving behavior by processing huge amounts of human driving data, this requires processes such as data collection, annotation, and cleaning. The goal is for the learning model to understand the data to improve the efficiency and accuracy of the learning process.
Basically, this learning process is similar to the human learning process when driving. However, the difference is that the system learns and corrects passively, while humans can actively learn and adjust their driving style.
For example, take a left - turn intersection with two left - turn lanes. Human drivers tend to choose the lane with less traffic and keep different distances from the vehicles in front. In contrast, rule - based and end - to - end driver assistance systems usually choose the innermost lane. Or when a vehicle enters an on - ramp and encounters a traffic jam, human drivers can choose when to merge, while the system may just let the vehicle stand still.
Another problem is that systems without active learning and correction capabilities cannot handle all possible situations. As Liu Xianming, the head of XPeng's Autonomous Driving Center, said: "Even if we can solve 99% of the edge cases (extreme situations) every day, we can never cover all cases. Only when we can consider all possible situations can we reach L4. This is an insoluble problem."
Li Xiang, the founder of Li Auto, explains that end - to - end systems cannot understand the real physical world. They only take three - dimensional images from the vision system and indicate a driving lane based on the vehicle speed. End - to - end systems are sufficient for most general scenarios, but problems can occur in particularly complex situations that they have not learned yet. This also explains why all mass - produced end - to - end systems have been two - stage rather than one - stage since last year.
To address the weaknesses of end - to - end systems, Li Auto has integrated a VLM (Visual Language Model). However, since these models are open - source, their ability in traffic scenarios is limited and can only play a supporting role, such as recognizing the red traffic light counter and issuing driving commands in combination with the navigation map.
Imitating human driving behavior has proven insufficient to bring driver assistance systems to the L3 level because imitation itself has weaknesses. There must be a clear model, and all possible driving maneuvers must be considered – similar to nesting dolls, where there is always a smaller one inside.
"If the imitation route doesn't work, we should go back to the starting point," said Liu Xianming. "Autonomous driving is not just an imitation of learning but a new interpretation of the world and driving like a human."
Li Xiang shares similar views: The third stage, namely VLA (Vision - Language - Action Model), uses 3D vision and 2D information to understand the real physical world and even read and understand navigation software and how it works. In contrast, a VLM only sees an image. A VLA also has its own brain system that can understand the physical world and perform driving maneuvers like a human based on its own language, way of thinking, and logical abilities.
Li Auto's VLA is also called the "VLA Driver Large Model." Its principle is to convert visual images into language and then perform actions. XPeng is even more radical: At the XPeng Tech Day 2025 on November 5th, He Xiaopeng announced that XPeng will directly skip the language conversion step in its new VLA model and instead convert multimodal physical signals from the cameras directly into continuous control commands.
XPeng's second VLA also has a logical process, but it is hidden in the model and not carried out by an explicit language model. Removing the "L" (language) has two advantages: First, it increases the simplicity and efficiency of the model and reduces information loss during signal transmission. Videos and signals from IMU (Inertial Measurement Unit) can be directly converted into continuous control commands without language conversion. Second, it enables the system to conduct "self - supervised learning" on a large scale. The videos collected by the vehicles from the physical world can be directly used as training data, giving the system strong generalization ability.
Liu Xianming said: "In every foreign market, XPeng doesn't have to create new maps or annotate data. As long as there are XPeng vehicles, we can train the model and quickly deploy and implement it."
Each company chooses a different route for the third stage of autonomous driving. Lang Xianpeng, the deputy chairman for the research and development of autonomous vehicles at Li Auto, believes that Huawei was strong in the era of rule - based systems and that one cannot defeat Huawei with rule - based approaches. The end - to - end concept was originally a new technological direction, but now it is already an old market segment. "If Li Auto wants to produce real autonomous vehicles, it has to go to a different battlefield, namely VLA."
In March this year, Li Auto introduced its VLA technology package. Since then, the discussion about whether VLA can work has increased. Jin Yuzhi, the CEO of Huawei's Business Unit for Intelligent Automotive Solutions, believes that Huawei will not follow the VLA technological direction because converting videos into language tokens and then controlling the vehicle is "artificial." Huawei's WAWE architecture is similar to XPeng's second VLA and also skips the language step by directly controlling the vehicle through multimodal information such as vision, hearing, and touch.
Wu Yongqiao, the president of Bosch's Intelligent Driver Assistance and Control System Business Unit in China, mentioned four difficulties in implementing VLA: Aligning multimodal features is difficult; extracting multimodal training data is difficult; large language models have inevitable "illusions"; and the memory bandwidth of today's driver assistance chips is not specifically designed for large models and cannot transmit and process large amounts of data.
Similar to Huawei, NIO has chosen the route of world models. Ren Shaoqing, the chief expert and deputy chairman for the research and development of autonomous vehicles at NIO, shares similar views. He believes that VLA combines language and action but still relies on language. The bandwidth of language models is not enough to handle the complexity and continuity of the real world.
In Liu Xianming's opinion, there can be information loss when aligning multimodal features in VLA models. In a nearly six - second video clip from XPeng, there are many visual information, road conditions, and vehicle movements. When a VLA model is used, this information must first be converted into text, and alignment means maximizing the accuracy of this conversion. "In fact, describing with thousands of words still causes information loss compared to the direct video. This is the problem that we don't get what we see. We believe that removing the language step is the simplest and most efficient solution," said Liu Xianming.
But all routes ultimately lead to "high computing power, large amounts of data, and large models." In XPeng's second VLA, which is intended for the ultra - versions of the vehicles, a computing power of up to 2250 TOPS is supported, provided by three self - developed Turing AI chips. NIO has also developed its own chips. In NIO's world model, Ren Shaoqing advocates integrating reinforcement learning models because he believes that this is the key to improving short - term imitation learning to process long - term time series.
Before moving to NIO, Ren Shaoqing founded the autonomous vehicle company Momenta with Cao Xudong in 2016. His move to NIO in 2020, when the company was in a difficult phase, to lead the development of autonomous vehicles was a sensational news at that time. He recognizes the importance of large amounts of high - quality data for the technological transformation of artificial intelligence, which was the main reason for his move from Momenta to NIO.
Fluctuating Internal Organization
The change in the technological direction in the field of autonomous driving started at the end of 2023. In November 2024, XPeng postponed the originally planned Tech Day on October 24th to November 7th. Before the Tech Day, the XPeng P7+ was introduced, and Li Liyun, who had just been appointed as the deputy chairman and was the head of the Autonomous Driving Center, gave a presentation on end - to - end technology.
However, XPeng has already pursued two research directions internally: the traditional VLA and the innovative VLA, which is the precursor of the second VLA and skips the language step. Since there had been no progress for a long time, "even the team leaders were ashamed to participate in the monthly and weekly meetings." He Xiaopeng even considered focusing on the traditional VLA first.
The light at the end of the tunnel for the innovative VLA appeared in the second quarter. At the request of the Autonomous Driving Team, He Xiaopeng tested the then VLA model, and the experience exceeded his expectations. Then he decided to abandon the traditional VLA and instead focus on the second VLA.
Liu Xianming is the leader of the innovative VLA project. He joined XPeng in March 2024 and had previously conducted research on machine learning and computer - assisted image processing at Meta and Cruise. He Xiaopeng trusts him very much and often talks to him for hours.
In October 2025, shortly after the National Day holidays, XPeng announced internally that Liu Xianming would replace Li Liyun as the head of the Autonomous Driving Center. Li Liyun and Liu Xianming represent different technological directions in the field of autonomous driving. Li Liyun focuses on driver assistance products and the implementation of functions, and he has contributed to making XPeng's NGP available in hundreds of cities. In contrast, Liu Xianming focuses on the development of a world model that can simulate the physical world to make XPeng's autonomous driving technology more generalizable and applicable worldwide. This adjustment means that XPeng's technological direction in the field of autonomous driving has completely changed from function implementation to a fundamental model.
Previously, Geely, NIO, and Li Auto have already adjusted their organizations...