HomeArticle

The history of faction disputes in autonomous driving

脑洞汽车2025-09-28 10:48
These pending disputes are leading autonomous driving towards an uncertain future.

The commercial implementation of autonomous driving is accelerating globally.

As of May 2025, Waymo had 1,500 autonomous taxis operating in San Francisco, Los Angeles, Phoenix, and Austin in the United States, completing over 250,000 paid trips per week. Baidu Apollo had deployed over 1,000 driverless cars globally, providing over 11 million trips in total and covering a safe driving distance of over 170 million kilometers.

Large-scale implementation may seem to imply that the technology is already mature, but that's not the case. There are still many divergent schools of thought on autonomous driving that have yet to reach a consensus.

For example, in terms of sensor solutions, how should one choose between a pure vision solution and a multi-sensor fusion solution? In the system architecture, should one adopt a modular design or embrace the emerging end-to-end architecture? Furthermore, regarding how to understand the world, which is better, VLA or VLM?

These unresolved controversies are leading autonomous driving towards an uncertain future. Understanding these different technical routes means understanding where autonomous driving comes from, where it's going, and how to achieve the technology's self-evolution.

The Battle of the Eyes

Pure Vision vs. Multi-Sensor Fusion

Everything starts with "seeing". How a car perceives the world is the cornerstone of autonomous driving. On this issue, there have been two long-standing opposing camps, and the debate continues.

The story can be traced back to a challenge in the Mojave Desert in the United States in 2004.

At that time, the U.S. Defense Advanced Research Projects Agency offered a $2 million prize to attract dozens of top universities and research institutions to participate, attempting to answer the question, "How can a vehicle perceive its surrounding environment?"

The lidar chosen by teams from Carnegie Mellon University and Stanford University won. This technology, which can generate precise 3D point cloud maps, laid the foundation for the early development route of autonomous driving and was inherited and developed by Waymo, a subsidiary of Google.

However, this school has a fatal weakness: cost. A lidar system costs up to $75,000, more expensive than the car itself, which means it can only follow a small-scale, elite route and is difficult to commercialize on a large scale.

Ten years later, the vision school represented by Tesla took a different path.

They advocate simplicity: "Humans can drive with just a pair of eyes and a brain. Why can't machines?"

In 2014, Tesla launched the Autopilot system, adopting Mobileye's vision solution and choosing a vision-based solution centered on cameras. In 2016, Elon Musk publicly stated that "lidar is futile", officially establishing the pure vision technical route.

The team simulates the human field of vision through eight surround cameras and relies on deep learning algorithms to reconstruct the 3D environment from 2D images. The pure vision solution has extremely low costs and can be commercialized on a large scale. By selling more cars and collecting more real-world data, a "data flywheel" is formed to feed back into algorithm iteration, making the system stronger with more use.

However, cameras are "passive" sensors and rely heavily on ambient light. In situations such as backlighting, glare, at night, in heavy rain, or in thick fog, their performance will significantly decline, far inferior to lidar.

The multi-sensor fusion solution centered on lidar believes that the intelligence of machines cannot fully match human common sense and intuition based on experience in the foreseeable future. In bad weather, hardware redundancy such as lidar must be used to make up for the deficiencies of software.

It can be said that the pure vision solution concentrates all the pressure on the algorithm, betting on the future of intelligence. The multi-sensor fusion solution focuses more on engineering implementation and chooses a proven real-world solution.

Currently, mainstream automakers (such as Waymo, XPeng, and NIO) are on the side of multi-sensor fusion. They believe that safety is an insurmountable red line for autonomous driving, and redundancy is the only way to ensure safety.

It's worth noting that the two routes are not completely distinct but are learning from and integrating with each other. The pure vision solution is also introducing more sensors, and in the multi-sensor fusion solution, the role of vision algorithms is becoming increasingly important and is becoming the key to understanding scene semantics.

The Battle of the Touch

Lidar vs. 4D Millimeter-Wave Radar

Even within the multi-sensor fusion camp, there's a choice to be made:

The cost of a millimeter-wave radar is only a few hundred dollars, while early lidar cost tens of thousands of dollars. Why spend so much on lidar?

Lidar (LiDAR) can construct extremely detailed 3D point cloud images of the surrounding environment by emitting laser beams and measuring their return time, solving the fatal "Corner Case" (extreme cases) that other sensors couldn't solve at the time.

It has extremely high angular resolution and can clearly distinguish the posture of pedestrians, the outline of vehicles, and even small obstacles on the road surface. In the field of L4/L5 commercial autonomous driving, no other sensor can meet the two requirements of "high precision" and "detecting static objects" simultaneously. To achieve the most basic autonomous driving functions and safety redundancy, the cost of lidar is a price that automakers have to pay.

If lidar is already so powerful, why develop other sensors?

Lidar has extremely high performance but also has its limitations. Laser belongs to infrared light with a very short wavelength. The size of particles such as raindrops, fog droplets, snowflakes, and dust is similar to the laser wavelength, which will cause the laser to scatter and be absorbed, generating a large number of "noisy" point clouds.

The 4D millimeter-wave radar can work around the clock. In bad weather, it can use its powerful penetration ability to detect obstacles ahead first and provide distance and speed data. However, the echo points of millimeter-wave radar are very sparse and can only form a small number of point clouds. It cannot outline the shape and contour of objects like lidar and may also produce "ghost recognition" due to electronic interference. Its low resolution means it can never be the main sensor and can only be used as an auxiliary sensor in vehicles.

Therefore, lidar and millimeter-wave radar each have their own advantages and disadvantages. They are not in a substitution relationship but follow a complementary logic of "using millimeter-wave radar to control costs in normal scenarios and lidar to ensure safety in complex scenarios", and different vehicle models have different configurations.

L4 Robotaxis and luxury cars usually adopt the strategy of "mainly using lidar and supplemented by millimeter-wave radar", piling up sensors regardless of cost to pursue the ultimate safety and performance ceiling. L2+ and L3 mass-produced economy cars mainly rely on "cameras + millimeter-wave radar" and use 1 - 2 lidars at key positions on the roof to form a cost-effective solution.

The debate among automakers about sensor selection is essentially a technical exploration and business game about "how to achieve the highest safety at the lowest cost". In the future, various sensors will be further integrated to form diverse matching solutions.

The Battle of the Brain

End-to-End vs. Modular

If sensors are the eyes, then algorithms are the brain.

For a long time, autonomous driving systems have adopted a modular design. The entire driving task is broken down into independent subtasks such as perception, prediction, planning, and control. Each module has its own responsibilities, with independent algorithms and optimization goals, like a well-defined assembly line.

The advantage of the modular design is strong interpretability, parallel development, and easy debugging. However, local optimization does not equal global optimization, and the divide-and-conquer model also has fatal flaws. When each module processes and transmits information, it will simplify and abstract to a certain extent, resulting in the loss of the original rich information during the layer-by-layer transmission, making it difficult to achieve the optimal overall performance.

From 2022 to 2023, the "end-to-end" model represented by Tesla FSD V12 emerged and disrupted the traditional paradigm. The inspiration for this solution comes from the way humans learn. Novice drivers don't first learn optical principles and then study traffic rules; instead, they directly learn to drive by observing the coach's operations.

The end-to-end model no longer makes artificial module divisions. Instead, by learning a vast amount of human driving data, it builds a large neural network that directly maps the raw data input from sensors to terminal driving control commands such as steering wheel angle, accelerator, and brake.

Different from the modular algorithm, the end-to-end model has no information loss throughout the process, has a high performance ceiling, and can further simplify the development process. However, it also has the black-box problem of being difficult to trace the problem source. Once an accident occurs, it's difficult to determine which step went wrong and how to optimize it later.

The emergence of the end-to-end model has shifted autonomous driving from rule-driven to data-driven. However, its "black-box" nature has deterred many automakers that value safety more, and only companies with large fleets can support the massive training data required.

Therefore, a compromise "explicit end-to-end" solution has emerged in the industry, which retains intermediate outputs such as drivable areas and target trajectories in the end-to-end model, attempting to find a balance between performance and interpretability.

The Battle of the "Soul"

VLM vs. VLA

With the development of AI, a new battlefield has opened up within large models. This concerns the soul of autonomous driving. Should it be a thinker (VLM) for assisted driving or an executor (VLA)?

The VLM visual language model believes in collaboration and pursues more controllable processes, and is also known as the enhancement school. This route believes that although large AI models are powerful, hallucinations are fatal in the safety field. They should be allowed to do what they are best at (understanding, explaining, reasoning), and the final decision-making power should be given to the traditional autonomous driving modules that have been verified for decades, which are predictable and adjustable.

The VLA visual language action model believes in emergence and pursues the optimal result, and is known as the ultimate form of end-to-end. This school advocates that as long as the model is large enough and there is enough data, AI can learn all the details and rules of driving from scratch, and ultimately its driving ability will surpass that of humans and rule-based systems.

The debate around VLM and VLA is like a continuation of the debate between the modular and end-to-end solutions.

VLA has the black-box dilemma of being difficult to trace. If a VLA car suddenly brakes hard, engineers can hardly trace the cause. Did it misjudge a shadow as a pothole? Or did it learn a bad habit from a human driver? It cannot be debugged or verified, which fundamentally conflicts with the strict functional safety standards of the automotive industry.

The VLM system can be decomposed, analyzed, and optimized throughout the process. If there's a problem, engineers can clearly see that the traditional perception module sees an object, the VLM identifies it as "a plastic bag blown away by the wind", and the planning module decides "no need for emergency braking, just slow down slightly". In case of an accident, the responsibility can be clearly defined.

In addition to the polarization in interpretability, the training cost is also one of the reasons why automakers are hesitant.

VLA requires a vast amount of paired "video - control signal" data, that is, input an 8-camera video and output the synchronized steering wheel, accelerator, and brake signals. This kind of data is extremely scarce and expensive to produce.

VLM is essentially a multi-modal large model that can use the rich "image - text" paired data on an Internet scale for pre-training and then fine-tune it with driving-related data. The data source is wider, and the cost is relatively lower.

Currently, VLM technology is relatively mature and easier to implement. Most mainstream automakers and autonomous driving companies (including Waymo, Cruise, Huawei, XPeng, etc.) are on the VLM route. The explorers of the VLA route are represented by Tesla, Geely, and Li Auto. It is reported that Geely's Qianli Technology's Qianli Haohan H9 solution uses the VLA large model, which has stronger reasoning and decision-making abilities and supports L3-level intelligent driving solutions.

Looking back at the debates among different schools of autonomous driving, we find that these technological debates have never ended with one side completely winning. Instead, they are integrating with each other in the confrontation and moving towards a higher-level unity. Lidar and vision are being integrated into a multi-modal perception system; the modular architecture is starting to absorb the advantages of the end-to-end model; large models are injecting cognitive intelligence into all systems.