He Xiaopeng: Fully autonomous driving will truly arrive in the next 1-3 years | Frontline Report
He Xiaopeng, Chairman of XPeng Motors
Text | Xiao Man
Editor | Li Qin
"Through our internal comparative evaluation, I believe we are nearly five times ahead of the industry's top players," said He Xiaopeng, Chairman and CEO of XPeng Motors, during the communication after the launch of the second-generation VLA.
The evolution of intelligent driving is changing from "software-defined vehicles" to "AI-defined super intelligent agents." Under this new wave, XPeng Motors has presented their radical solution for the future: skipping the L3 stage, which faces compromises in terms of hardware, software, and regulations, and directly using L2 and L4 as the core anchors for the evolution of intelligent driving.
In He Xiaopeng's view, the second-generation VLA has given XPeng the possibility to directly enter the L4 stage from L2.
Like Tesla, XPeng is no longer making minor repairs within the original intelligent driving framework. Instead, it is completely treating autonomous driving as the implementation of general artificial intelligence (AGI) in the physical world to solve problems. Before the strategic change, XPeng had merged the intelligent cockpit center and the autonomous driving center, concentrating AI resources to form a unified middle platform to improve development efficiency.
XPeng's current idea is to introduce the concept of building a world model to achieve the in-depth integration of the intelligent cockpit and intelligent driving. This will make the intelligent cockpit and intelligent driving no longer isolated but integrated into a "powerful super intelligent agent." In the next 1 - 3 years, it aims to achieve a leap from a passive tool to an active service.
The foundation for realizing this idea is to have the best base model and solve the data problem. Liu Xianming, the person in charge of XPeng Motors' General Intelligence Center, believes that "doing a good job in the base model is a compulsory course for a company doing L4. If you don't do this, you may fall behind in this technological transformation or be unable to complete a complete technological transformation."
XPeng's technological transformation is decisive enough, and the intelligent software upgrade has become the core focus of XPeng Motors' products. However, as previously reported by 36Kr Auto, XPeng is still an automobile company with car sales as its core revenue. In the highly competitive Chinese automobile market where the intensity of competition is still increasing, all companies, including XPeng, need to seek transformation under the dual changes of the market environment and technology.
The following is the content of the communication between 36Kr Auto and He Xiaopeng, Chairman and CEO of XPeng Motors, and Liu Xianming, the person in charge of XPeng Motors' General Intelligence Center, slightly edited:
Question: Why does XPeng suggest skipping L3 and put forward this suggestion to the Two Sessions? Is it to have a more advanced technology?
He Xiaopeng: I think that starting from L4, there will be a new responsible entity. In today's global technological development, basically the next step from L2 is L4. Adding an L3 stage in between actually poses challenges to hardware, software, and laws and regulations. So from my perspective, I think China should have L2 and L4.
Question: How many vehicles will the second-generation VLA be installed in? Can you give a rough estimated figure?
He Xiaopeng: All our Ultra and UltraSE models will be equipped with the second-generation VLA. You can understand that in the future, XPeng's models in the global market will offer two options: basic intelligent assisted driving and top-level intelligent assisted driving.
Question: To what extent can the second-generation VLA perform? Has it fully reached L4, or at what stage is it?
Liu Xianming: Currently, it's hard to say to what exact level it has reached. We haven't fully claimed to have achieved 100% L4. However, the entire VLA 2.0 has built a very general and efficient architecture. So basically, there is a new version coming out every day, constantly iterating to solve new problems, and the progress speed is beyond our imagination. Therefore, we are confident that we can achieve a relatively complete L4-level system in the future.
It's difficult to give a definite judgment on the specific time. The "big brother" (He Xiaopeng) has given a judgment of 1 - 3 years. Our judgment is that if the daily iteration speed is faster than the previous day, and we see that the training speed and data scale curve are in an accelerating upward state, and if we can maintain this state, I believe it will be achieved soon.
Question: Why did you merge the intelligent cockpit and intelligent driving to form such an organizational structure adjustment? Currently, this change also seems to be a trend among car companies. How is XPeng Motors' adjustment different from that of other car companies?
He Xiaopeng: The automotive industry is entering a new stage of cross-domain integration: autonomous driving is the vehicle's movement, the intelligent cockpit is the vehicle's brain, and combined with power and chassis, we believe that these four domains are all in the process of cross-domain integration.
In the future, for L4 or Robotaxi models, many manufacturers will shift from the original single-domain integration (such as integrating multiple suppliers in one domain or developing a single domain independently) to cross-domain integration. This can make the whole vehicle faster, safer, more sensitive, and increase its capabilities several times, shifting from passive use to active service. Therefore, the General Intelligence Center in charge of Xianming is part of the cross-domain integration process.
This is also why I firmly believe that fully autonomous driving will be realized in 1 - 3 years, and all cars will become powerful super intelligent agents in 3 - 5 years.
Question: The second-generation VLA will achieve an end-to-end intelligent revolution and will be fully pushed out by the end of this month. How will this product strategy with unified underlying technology and dual power options define XPeng's approach in the high-end market in the next three years?
He Xiaopeng: In the next 1 - 3 years, the automotive industry will enter the AI era from the software era, moving from the independent development of software and hardware to cross-domain integration, and upgrading from the original simple intelligent new energy vehicles to high-level intelligent agents capable of active service. Since XPeng is conducting R & D in multiple fields simultaneously, you will see many cross-domain integration effects in the next 1 - 3 years.
This is also why I'm very excited to think that it's becoming increasingly difficult for traditional car manufacturers, including those in the fuel - vehicle era, to come up with good solutions. Cars will definitely change from the original passive production tools to products that can actively generate productivity. I think it's an epoch - making product, and it will be realized in about 3 - 5 years.
Question: You mentioned just now that the base model is the foundation for doing a good job in L4. From an industry perspective, currently, many Robotaxi players don't mention the base model much or choose other technical routes. Will the base model become the standard for Robotaxi companies to do their business well in the future?
Liu Xianming: There have been significant changes in the technological paradigm of L4 or autonomous driving. In the past, we saw Waymo and many other L4 companies. Their upper limits were very low, and everyone could only keep competing. This brings another problem, which is the ODD concept of L4. Where the vehicle can operate depends only on how many cars are deployed, how much data is collected, and how many maps are built. So if you really want to solve the problem in a general way, the technological paradigm must change, which is inevitable.
As we also mentioned at the press conference today, doing a good job in the base model is a compulsory course for a company doing L4. If you don't do this, you may fall behind in this technological transformation and be unable to complete a complete technological transformation.
Question: Regarding the overseas expansion of the second-generation VLA, you mentioned that currently, in the case in Sweden, cloud models are used for simulation training. When Tesla was developing FSD in China, it also promoted the process through methods such as online videos and simulation training. How can we avoid problems similar to Tesla's "acclimatization"?
Liu Xianming: Without any adaptation and training with overseas data, as can be seen from the video released by the "big brother" (He Xiaopeng) today, the second-generation VLA model already has strong capabilities. Secondly, XPeng is a global company. Under the premise of compliance, we will normally own and use local data wherever there are XPeng vehicles in the world. Thirdly, for more general scenarios, through the generation method of the world model, we can also quickly reach a starting point of ability.
Therefore, XPeng's global autonomous driving strategy must combine these points: the model itself must have strong generalization ability and cannot rely only on Chinese data and only operate in China, which is not feasible; combined with XPeng's global layout and our technological breakthroughs.
Question: If the world base model empowers diverse intelligent agents simultaneously, will there be bottlenecks in technology reuse in aspects such as multi-modal interaction and spatial perception? Can different forms of intelligent agents feed back to the base model and deepen the model's optimization?
Liu Xianming: The underlying reuse ability should be quite strong. The design of the entire VLA or the base model is natively multi-modal and is not specifically targeted at autonomous driving, so it can be reused. We are still continuously exploring the specific reuse situation, and currently, we can't give a very clear conclusion. At present, the primary task is to complete the whole process on the vehicle first, and then promote the linkage between the cockpit and driving.
Question: After autonomous driving has entered the model paradigm from end-to-end, everyone is using human data for imitation learning. Today, Xianming also shared a case of a large amount of reinforcement learning of the world model in the simulation world. But since last year, many people have been saying that human data is of little value. What's your view?
He Xiaopeng: I think the amount of data in the physical world and the human world is currently infinite.
Previously, I thought that having 100,000 or 1 million cars running a certain number of kilometers would be enough, but now I think it's far from enough. Many people say that they have a fleet or a company, and selling more cars means having a lot of data. These are all wrong. I think it's very difficult to collect high-quality, valuable, and ultra-large-scale data. Whether it's for cars or robots, there's still a long way to go in this regard. This is my view.
Question: Is RL reinforcement learning really a panacea that can solve all problems? Are there things it's not good at?
Liu Xianming: Reinforcement learning is not a panacea. Currently, both the academic and industrial circles are saying that reinforcement learning is very powerful, but it definitely requires a very strong base model - at least it can sample a feasible solution to solve the problem. If it doesn't even have this ability, reinforcement learning can't continue to improve.
However, reinforcement learning is a learning method with high efficiency, which can solve problems in a targeted way and continuously explore long-tail problems. So I think we shouldn't regard reinforcement learning as a universal solution that can solve everything, but rather as a very efficient learning method.
Question: Currently, the publicity of the computing power arms race in the market is intensifying. Competitors are frantically increasing computing power, but many users find that after the actual experience, although the computing power has increased significantly, the improvement in the sense of experience is not as obvious as the numerical increase. Where does the problem probably lie?
Liu Xianming: Computing power is not just about having a good-looking nominal number. More importantly, it's about using the computing power well, which is the core problem. This is also the reason why we are transitioning from general-purpose processors to dedicated processors (ASIC). In fact, if you look at NVIDIA, it did this during the GPU and CUDA era - using computing power well is more valuable than simply saying that the computing power has increased several times. So the computing power not only needs to be large, but using it well is the core key.
In addition, large computing power definitely requires higher information density input and a larger model to match; otherwise, the computing power will be idle. All these factors together mean that if you only engage in a computing power arms race and simply increase the numerical value, consumers won't feel a significant improvement in the sense of experience.
Question: From the industry practice in the past 2 - 3 years, there are mainly two methods for the model to make decisions on how to generate trajectories. The first method is that the large model directly gives the final trajectory, and the second is to give several different trajectories and then let the system choose one. Does XPeng's second-generation VLA use the former or the latter? In your opinion, is there any difference in advantages and disadvantages between these two different solutions, and which solution is more in line with the future development trend?
Liu Xianming: The core of the first question is whether you are doing autonomous driving or AI. If you answer this question, the answer is actually clear. What we are doing is an AI, not just specifically for autonomous driving. So how an AI model is made is how we do it.
Since we have made such a big change, we won't carry many previous logics, these heuristics (heuristic methods based on empirical rules), that is, many rules or methods to solve current problems. This is also the most important core for making the data and the model scale continuously (continuously improving the model's ability by increasing the data volume, model parameter scale, and computing power input), and we should add as few other things as possible.
This may sound a bit too simple and direct, but in the past few years, the development of the entire AI has told us this one thing, which is how to scale and how to iterate quickly, so as to solve problems quickly. The core is whether you are doing intelligent driving or AI.
Question: During the model training process, are there trade - offs among the three keywords "safety, scenario, and efficiency"? Is there a clear priority order? In your current observation, are there only two companies globally that have switched to the native multi-modal physical world large model?
Liu Xianming: If you have done machine learning or AI, you know the PR curve. When the curve is relatively flat, you can only make a trade - off among safety, efficiency, and scenario. In essence, there is no choice.
In autonomous driving, we often ask: What kind of autonomous driving do we really want?
The most core goal of autonomous driving is safety. But it doesn't mean that safety can take precedence over all other things. No one wants a slow, inefficient, and immobile vehicle just for the sake of safety. To solve this contradiction, the core is to improve the basic ability - only when the basic ability is improved can we achieve a higher level of safety without sacrificing other dimensions.
What we call the "generational gap" is not just the gap in a single indicator. More importantly, it's about whether the entire way of doing things has been switched and whether there has been a qualitative change in the iteration speed. What we are currently pursuing is not only to run fast but also to have a continuously increasing acceleration because we are building an underlying general ability system. This is the real generational gap, rather than just leading in a single indicator.