A Handbook for Understanding the Jargon of Autonomous Driving: New EV Startups Are Creating New Terms Again
"End-to-end" has not been fully realized, "VLA" has quietly emerged, and the "world model" is becoming the new technological totem... There are more and more "jargons" in the autonomous driving industry, and they are becoming increasingly difficult to understand.
Both Li Auto and Xpeng have bet on VLA (Visual Language Action, a visual language action model) as the next-generation technological architecture, claiming that it can endow vehicles with the ability to "think." However, Huawei says that VLA is a "sneaky" technology and instead promotes its self-developed WA (World Behavior Architecture, a world behavior architecture), directly deploying the "world model" technology on the vehicle end. Ren Shaoqing, the vice president of NIO, emphasized in a recent interview that NIO was "the first to propose the concept of the world model in China."
Behind the endless stream of "jargons" is a battle for the right to speak around the next-generation autonomous driving technology. As hardware and configurations tend to become homogeneous, intelligent driving capabilities have become the most crucial identity tags for new car-making forces. Defining the future first means seizing the high ground of user perception and technological branding. The obscure "jargons" are not only declarations of technological routes but also careful packaging of technological brands.
However, behind the noisy concepts, there is a gap in the implementation experience and pressure on the R & D teams. As Wu Xinzhou, the vice president of Xpeng's autonomous driving, said two years ago, "Autonomous driving is not advertising." However, "futures-style" technology releases are still common. A senior executive of a car company once confessed his distress to "Yunjian Insight": To be ahead of competitors in releasing new technologies, the team is often pushed to the forefront before the technology is mature. As the delivery deadline approaches, the team is in a panic. Any delay or mistake after the release is an unacceptable result.
Technology is supposed to drive progress. But when the speed of "coining words" is faster than that of technological progress, users may not get a "disruptive experience" but a Beta version that still needs continuous optimization. This article attempts to sort out the evolution context of autonomous driving technology behind the terms and attach a "jargon usage manual" for users.
The Origin of Jargons
Before 2022, the technological evolution path of the autonomous driving industry was relatively clear, mainly defined by Tesla and Waymo. Most of the technical terms were objective descriptions of specific functions.
The early assisted driving systems were based on rules written by engineers and were divided into three major modules: perception, planning, and control. Since 2016, Tesla has led the industry from the rule era to the AI (Artificial Intelligence) era through self-developed software algorithms and FSD chips.
In 2021 and 2022, Tesla's two consecutive AI DAYS had a profound impact on the industry. At the first AI DAY, Tesla announced the BEV + Transformer technological architecture. This solution projects the 2D images captured by multiple cameras onto a unified top-down coordinate system to form a 360° bird's-eye view (BEV, Bird’s-Eye-View) around the vehicle, effectively solving the problems of occlusion and perspective. At the same time, Tesla also proposed the early concept of the Occupancy Network, which directly converts 2D images into a 3D vector space.
Before that, the traditional method was to use Convolutional Neural Networks (CNN) to process the 2D images of each camera separately and then fuse them into a 3D environment. The BEV + Transformer realizes the "pre-fusion" of features across cameras, greatly improving the perception ability.
This also enables Tesla to get rid of its dependence on high-precision maps and achieve a wider range of scene generalization capabilities only with the sensors on the vehicle. Later, in the FSD Beta V11 version, Tesla extended the NOA (Navigation on Autopilot) function from highways to urban roads.
China's new car-making forces and assisted driving suppliers quickly followed this technology. However, due to the gap in neural network algorithms with Tesla and doubts about the pure vision route, they generally integrated information provided by lidar or 4D millimeter-wave radars in the early stage.
From 2022 to 2023, Xpeng's XNGP, NIO's NOP+, Li Auto's AD Max 3.0, and Huawei's ADS2.0 successively achieved mass production of their self-developed BEV + Transformer solutions. Using these as weapons, they launched a race to "open cities" for mapless NOA.
Paradigm Shift: End-to-End
If the AI DAY in 2021 triggered a revolution in perception technology, then the AI DAY in 2022 completely broke the boundary between perception and planning, promoting a paradigm shift centered on "end-to-end."
At the press conference, Tesla disclosed a preview of the architecture of FSD Beta V12: A huge neural network is used to process perception and planning simultaneously, replacing the 300,000 lines of code written by engineers. The upgraded Occupancy Network identifies unknown obstacles by dividing the 3D space into tiny voxels, achieving a leapfrog improvement in perception ability.
China's new car-making forces "crossed the river by feeling the stones of Tesla" again and collectively turned to the end-to-end architecture. Among them, Xpeng even resolutely abandoned lidar and fully switched to the pure vision route.
However, considering the system's safety and maturity, Xpeng and Huawei both adopted a relatively conservative "multi-stage" end-to-end approach in the early stage, replacing the perception and planning modules with models respectively instead of fully integrating them. In the XBrain architecture released by Xpeng, the perception is driven by the Xnet network in the BEV + Transformer architecture, and the XPlanner model is responsible for planning. It was not until mid-2024 that Xpeng announced the push of a "single-stage" end-to-end system to all models based on the Fuyao architecture.
Huawei's ADS 2.0 also uses a two-stage end-to-end approach (BEV perception + PDP prediction and planning) and announced in 2024 that it would upgrade to an "end-to-end" architecture in ADS 3.0, removing the BEV network and using the GOD network for perception and the PDP network for pre-decision planning. However, a senior intelligent driving executive in the industry pointed out to "Yunjian Insight" at the end of this year that Huawei's technical solution at that time was still essentially a multi-stage one.
A technical staff member in the autonomous driving industry pointed out the challenges: In the early stage, China's new car-making forces had limited understanding of models, and the multi-stage design was easier to ensure safety. When problems occurred in the traditional system, engineers could solve them by modifying the code. However, the end-to-end model is a black box, with a higher upper limit but also a lower lower limit. "If something goes wrong, you don't even know how to fix it."
NIO's transformation to end-to-end technology was accompanied by organizational structure adjustments. In June 2024, NIO announced the merger of the perception and planning departments into a large model team to fully promote end-to-end R & D. Half a year later, Ren Shaoqing took over this department. In January 2025, the intelligent system Banyan 3.1.0 based on the end-to-end architecture was officially pushed.
Li Auto released a dual-system solution of "end-to-end + VLM" in 2024. The end-to-end model is responsible for "fast thinking" and handles most routine scenarios, while the VLM model is responsible for "deep thinking" and deals with a few complex situations.
Horizon, a supplier of intelligent driving chips and solutions, proposed a similar architecture earlier. In April this year, it released the HSD solution based on Journey 6P, which uses a single-stage end-to-end + VLM architecture. This solution is planned to be mass-produced on the Chery Xingjiyuan ET5 in November this year.
At a media communication meeting in September, Lü Peng, the vice president of Horizon and the person in charge of the strategy department, intelligent driving product planning, and the marketing department, divided the evolution of the end-to-end system into three generations:
First generation: Two-stage end-to-end. The perception and planning modules process the vehicle's lateral and longitudinal information separately and then splice the tasks together. The overall experience is quite fragmented.
Second generation: Single-stage end-to-end + heavy post-processing. The trajectory directly output by the end-to-end system has many defects. Therefore, rules are needed in the later stage to correct the lateral and longitudinal information and then combine them.
Third generation: A more thorough end-to-end. The perception information is input, and the driving trajectory is output. Compared with the first two generations, it has a faster response, less information loss, better lateral and longitudinal coordination, and a more human-like driving experience in the end.
At the media communication meeting in April, Yu Kai, the CEO of Horizon, admitted that although all companies were vigorously promoting their solutions as leading, there was no real single-stage end-to-end in China at that time.
Autonomous Vehicles Are "Wheeled" Robots
Before the end-to-end era, the autonomous driving industry mainly "copied homework" from Tesla. However, as Tesla no longer discloses technical details, China's new car-making forces have to catch up and explore at the same time. The booming generative AI and humanoid robot industries have become their new teachers.
In 2023, the success of ChatGPT verified the ability of a single large neural network to handle complex multi-modal tasks. The shift in the training method from imitation learning to reinforcement learning has also extended to the autonomous driving industry. Research in the robot field, such as VLA (Visual-Language-Action, a visual language action model) and the world model, has also been introduced into autonomous driving.
VLA was initially used to enable robots to understand human language instructions and perform actions. In 2023, Google DeepMind's RT2 (Robotic Transformer 2) model jointly trained a large amount of image, text, and robot action data to form a VLA model. Subsequently, the open-source model OpenVLA emerged, greatly reducing the research threshold for VLA.
Autonomous vehicles are often regarded as "wheeled robots" that perform fixed tasks, controlling the steering wheel, accelerator, and brakes by understanding maps, navigation, and human voice instructions. Tesla's end-to-end system architecture is considered to apply the concept of VLA.
DeepRoute.ai, a Chinese intelligent driving supplier, was the first company to publicly claim to apply VLA technology to autonomous driving. As early as September 2023, DeepRoute.ai proposed to develop an end-to-end model with "one-step perception and decision-making" and officially named it VLA in April 2024, planning for mass production within this year.
However, the market changed rapidly. In March this year, Li Auto suddenly announced the switch of its dual-system solution to a VLA solution and, ahead of its competitors, achieved mass production on the Li i8 in August this year.
Xpeng plans to push its VLA solution in the third quarter of this year, a few months later than its competitors. However, it has stacked 2200 Tops of computing power on the vehicle end (Ultra version), with about 1200 Tops used for assisted driving. During the same period, Li Auto's AD Max has a computing power of 700 Tops, and NIO's Shenji NX9031 chip has a computing power of 1000 Tops. He Xiaopeng, the CEO of Xpeng Motors, predicts that the computing power of Tesla's next-generation hardware platform AI 5 will be between 2000 Tops and 4000 Tops.
This computing power competition has also spread to the cloud. Tesla announced the construction of the Dojo computing power cluster in 2019. After switching to purchasing chips externally in 2024, it has stocked a large number of NVIDIA and Samsung chips. In 2025, it is expected to stock a cumulative 85,000 NVIDIA H100 chips. Xpeng and Li Auto have also increased their investment in cloud computing power. Xpeng says its cloud computing power scale is 10 EFlops, and Li Auto says it exceeds 13 EFlops.
Both companies are using cloud computing power to develop base models with a larger number of parameters. The success of DeepSeek has made car companies see the possibility of self-developing base models at a controllable cost. Li Auto's base model was initially used in the intelligent cockpit and mobile APP, led by Chen Wei, the person in charge of Li Auto's intelligent space AI, and later extended to autonomous driving.
Liu Xianming, the current No. 1 person in Xpeng's autonomous driving center, is responsible for the R & D of Xpeng's base model. At a communication meeting in May this year, Liu Xianming said that Xpeng's base model has 72 billion parameters, 35 times that of the mainstream VLA model. Subsequently, a smaller model (XVLA) will be generated through post-training, reinforcement learning, model distillation, etc., and then deployed on the vehicle end. The VLM model based on the same base model will also be deployed in the intelligent cockpit of the Ultra version this year.
World Model: From Simulation to Vehicle Control
Besides VLA, Li Auto and Huawei have chosen another path: directly applying the world model to real-time vehicle control on the vehicle end. Previously, the world model was mainly used for data generation and simulation testing.
The research on the world model in the AI industry began with a paper titled "World Models" by two DeepMind researchers in 2018. This model enables AI agents to plan and learn through "imagination" and then transfer to the real environment.
Robot simulation platforms were the first to use the world model or a similar framework to enable virtual robots to learn to manipulate objects, navigate, and perform simple grasping tasks. Through a large number of "imagination" trainings within the model, the number of real interactions is reduced.
Since 2022, Tesla's approach of constructing a 3D space through the Occupancy Network has applied the idea of the world model. Li Auto and Xpeng have also subsequently used the world model for simulation testing and cloud training.
Lang Xianpeng, the vice president of Li Auto's intelligent driving, said in an interview with "Yunjian Insight" last year that Li Auto was using the world model to build an "examination system" to test R & D results in the simulation environment. Xpeng used the world model to train its base model with 72 billion parameters, simulating environmental changes of the vehicle at different positions and perspectives.
NIO and Huawei's application of the world model is more radical. NIO directly deployed the world model on the vehicle end and named it NWM. Vice President Ren Shaoqing explained that NWM can generate 216 possible trajectories every 0.1 seconds during driving and select the optimal solution through evaluation.
The WEWA architecture released by Huawei in April this year also uses the world model for real-time vehicle control. Jin Yuzhi, the CEO of Huawei's Intelligent Automotive Solution Business Unit, regards it as "the ultimate solution to autonomous driving."
However, these cutting-edge technologies still need to be tested. An industry insider said that Li Auto is also discussing applying the world model to the vehicle end, but since the technology is not mature, it is still in the research stage. Another person close to NIO said that NIO's NWM model has not fully achieved the predicted ability it claims, and there is still a long way to go in R & D.
Conclusion
Terms are originally precise definitions of technology. Looking back at the evolution of autonomous driving technology, the rise of each term represents an exploration of the industry.
Tesla's early "jargons" were accepted by the industry because of its pioneering practices, and its user experience has always been leading. However, the current explosion of terms is often an early consumption of future visions.
What's more, some deliberately use vague jargons to confuse the essence of technology and bridge the gap with competitors.
When terms change from their original definitions to marketing buzzwords, users need to distinguish not only the technological differences among different companies but also the gap between the packaging of words and the real experience.
In this double competition of technology and words, the ultimate winner may not be the company that first proposed the new concept but the one that can transform technological promises into user experiences.
Manual for "Jargons" in Autonomous Driving
Rules/Models
The early assisted driving systems relied on rules (Rules), which were instruction codes written by engineers, and were divided into three modules: perception (Perception), planning (Planning), and control (Control). The perception module collects information about the vehicle's surrounding environment through sensors such as cameras and lidars; the planning module formulates driving strategies based on the perception data, avoids obstacles, and continuously optimizes the driving trajectory; the control module executes the planning instructions and controls the vehicle's steering wheel, accelerator, and brakes through the wire control system.
The model (Transformer) is a neural network trained with a large amount of data. By analyzing driving scene data, it can learn the complex rules on its own, understand, and summarize coping strategies for various traffic conditions. When dealing with unseen scenarios, the model can perform generalization reasoning and make anthropomorphic decisions. The shift from relying on rules to being driven by models is an important milestone in the evolution of autonomous driving systems.
BEV + Transformer
A visual perception technology proposed by Tesla.