Panicked! Human Drivers to Face the "VLA Challenge" in the Future

The VLA (Vision-Language-Action) model takes autonomous driving one step further.

Recently, two blockbuster new cars in the auto market have attracted a lot of attention. One is the Li Auto i8, and the other is the XPeng P7. Although they are positioned differently, they are both stars in their respective fields.

Under the coverage of numerous functions, both cars have mentioned a core technology, which is VLA. It is the underlying logic of intelligent assisted driving.

Li Auto said that their R & D of assisted driving has entered the AI era from the manual era. From 2021 to 2024, Li Auto always achieved it based on rule-based algorithms. Since 2024, Li Auto's assisted driving has entered the AI era.

Now, through VLA, car owners can even control the vehicle through voice commands. The system will also learn and evolve according to the user's driving style to achieve a human-like driving experience.

XPeng Motors recently also revealed that the R & D of the VLA in-vehicle large model is progressing smoothly and is expected to be pushed to all models in August ahead of schedule. Undoubtedly, the new XPeng P7 will be equipped with the VLA large model and become a safer "driver".

01 What is VLA?

The full name of VLA is "Vision-Language-Action", which is a vision-language-action model. The core is to integrate visual perception, language understanding, and action decision-making into one.

Among them, visual perception is to identify the information collected by hardware such as cameras and lidars, including road conditions, traffic signs, and the positions of other vehicles and pedestrians.

These data are input into the visual processing module. Using deep learning algorithms, the images are extracted and analyzed and transformed into a "language" that the computer can understand.

In this way, it can recognize traffic lights, judge the speed and direction of nearby vehicles, and detect pedestrians on the roadside.

Language understanding is to understand instructions, traffic rules, or high-level strategies through large model training. In addition, through an intermediate link, the visual and language models are integrated to build a unified environmental understanding.

If a passenger issues an instruction of "turn right at the next intersection", the VLA model will first understand this language information and then integrate it with the current visual perception information.

Finally, the action decision-making generates specific control instructions, such as accelerating, decelerating, and turning signals, which are sent to the vehicle's executive system. These instructions can precisely control components such as the throttle, brake, and steering wheel to achieve the intelligent driving of the vehicle.

VLA realizes a closed-loop of "image input, instruction output". Compared with the traditional independent division of labor of perception, planning, and control, they are completed in one system, which improves the scene adaptability.

Before VLA, the "end-to-end + VLM" architecture was mostly adopted, where VLM is (Vision-Language Model). In the field of intelligent driving, it tries to understand traffic scenarios and semantic parsing. For example, it can identify complex contexts such as "tidal lanes", "detours due to construction", and analyze "unprotected left turns", enabling the system to go from "seeing clearly" to "understanding". The "end-to-end" is responsible for handling perception, decision-making, and execution, and the two are relatively independent.

VLA deeply combines the understanding from VLM or other perception modules with the vehicle's steering, acceleration, braking, and other instructions, and directly completes the whole process from input to output.

02 What are the specific scenarios?

For example, in complex road conditions, the vehicle will encounter various traffic participants at the same time, including motor vehicles, pedestrians, bicycles, as well as constantly changing traffic lights and complex traffic signs.

After the VLA model "reads" this information through cameras and radars, it quickly analyzes the scene. If there is someone crossing the road and the traffic light will turn red in 10 seconds, the system will understand and judge and make a decision based on the actual situation. It may immediately decelerate and stop, wait for the pedestrian to pass, and give up this passing opportunity; or it may choose to avoid the pedestrian and quickly pass through the green light.

This anthropomorphic thinking logic is precisely the greatest advantage of the VLA model. Its generalization ability in scenarios and context reasoning ability are stronger. In addition, after incorporating language understanding, VLA can flexibly adjust the driving strategy according to instructions to achieve a human-machine collaborative experience.

In summary, the implementation of VLA in vehicles can bring many obvious improvements. Including defensive driving, the vehicle can automatically analyze potential risks on the road to avoid accidents; smooth driving, there is no obvious jerks during acceleration, deceleration, and overtaking; three-point turn, in a narrow space, the vehicle can complete a 180-degree turn through three direction adjustments of forward - backward - forward again. Functions that cannot be achieved by end-to-end can be achieved in VLA; continuous tasks, multiple driving instructions can be continuously communicated with VLA, and the vehicle will automatically execute the instructions one by one; basement driving, when parking in the basement of a community or a shopping mall, the vehicle can automatically recognize the signs in the parking lot and complete driving behaviors according to the signs.

03 The psychological basis of VLA

Intelligent driving perception detects through radars, lidars, and cameras, and then conducts image and semantic analysis to formulate behavior planning, and finally issues instructions to the steering wheel, throttle, etc.

This set of processes seems complicated, but from a psychological perspective, it is well - organized and fully conforms to the process of human beings' understanding of the world and governing their behaviors.

The most important and basic part of human ordinary psychology is information processing, which is divided into perception, consciousness, thinking, and language, and language is a kind of behavior controlled by consciousness.

When humans understand the world, the first step is sensory input, mapping what they see and hear to the brain. At this time, you don't know what you see, but there is only an image on the retina. Through the perceptual system and knowledge and experience, it is transformed into different things such as "apples, bananas, pears".

Then, in - depth processing of consciousness and thinking is carried out to guide you to make different decisions. Finally, it is demonstrated through actual actions by organs such as hands and feet.

What links our entire functional system is the neural network, especially the brain's neural network. It transmits electrical signals and is completed almost instantaneously, so we can't detect it and even think it is completed in one step.

In this way, the human brain is so powerful. When we see an intersection and a bustling crowd, we can almost subconsciously make a judgment and know how to pass; when we see half of a foot sticking out, we will predict that there is a possibility of someone dashing out.

VLA aims to achieve this effect. The logic behind it is similar to human beings' understanding of the world, and it also provides a reference perspective for the development of intelligent driving technology.

04 Conclusion

The emergence of VLA marks the transformation of intelligent driving from function superposition to cognitive integration. In a sense, it can "understand" the matter of driving. It tries to reproduce how humans perceive the world in a cold machine, integrating the visual "seeing", the language "understanding", and the action "execution" into an organic whole.

This is not only an improvement in efficiency but also a change in the intelligent driving experience towards "human - like". The boundary of human - machine collaboration will be rewritten, and it is also a leap from mechanical execution to cognitive intelligence.

Of course, currently, VLA is not yet perfect. The problem of its chip computing power is one of the most fatal bottlenecks at present. The current mainstream high - computing power chips are not designed to run such a large AI model. However, this also means that its breakthrough points are clear, and there will be more room for improvement with the advancement of technology.

This article is from the WeChat public account "Youjia", author: Ren Hongbin. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Panicked! Human drivers will face the "VLA challenge" in the future.

01 What is VLA?

02 What are the specific scenarios?

03 The psychological basis of VLA

04 Conclusion