Liu Fang, former Xiaomi Intelligent Driving expert: If VLA is successfully implemented, autonomous driving will become a sub - problem of embodied intelligence | Exclusive interview by 36Kr
"VLA is a large driver model that works like a human driver." On the evening of May 7th, Li Xiang, the CEO of Li Auto, said in an AI Talk.
This is the latest technological direction to emerge in the intelligent driving industry after the "end-to-end" approach.
The VLA (Vision-Language-Action) model was first introduced by Google's AI company Deepmind and was mainly used in the field of robotics. Subsequently, it gradually became the mainstream technological paradigm and framework in the field of embodied intelligence. Companies such as OpenAI and ByteDance are all following this path.
Different from visual language models (VLM) like ChatGPT and Sora, which focus on text, images, and videos, VLA adds the ability to interact with the physical world through "actions" on top of the former.
In other words, VLA not only understands the surrounding environment but also can directly output control instructions, such as robot movements or vehicle driving decisions. This has led to a deeper intersection between the two popular fields of intelligent driving and embodied intelligence.
However, the technological implementation and engineering deployment of VLA are still in the early stages. The fog of technological innovation still needs to be cleared through practice. For this reason, 36Kr Auto recently communicated with an embodied robot project called Amio Robotics, founded by talents with a background in intelligent driving technology, hoping to provide more references for the industry.
Amio Robotics was established in September 2024 by Liu Fang, the person in charge of intelligent driving technology products at Xiaomi Auto. In March this year, Amio completed a seed round of financing, with investors including Anker Innovations, Zhipu AI, and Matrix Partners China.
Liu Fang has experienced the entire process of establishing Xiaomi's autonomous driving team, technological R & D, and mass production. Even earlier, Liu Fang worked in the search business department of Google China.
Liu Fang told 36Kr Auto that the emergence of the generative large model GPT in 2023 had a great impact on practitioners: First, a larger amount of data can enable a larger model to generate intelligence; Second, the large model itself has understood a lot of human knowledge. When learning new skills, one doesn't have to rely solely on imitation but can establish an understanding of things by learning the knowledge behind the imitation data.
He used an analogy: Previously, working on each specific AI scenario was like climbing a mountain, always encountering difficulties. But after the emergence of GPT, there's no need to climb the mountain anymore. It's like a boat on the sea. As the sea level gradually rises, the mountain gets submerged.
The VLA model is a large model that can enable physical hardware to have real intelligence. Liu Fang believes that if the VLA path can be successfully implemented, autonomous driving will become a low - dimensional matter and a sub - problem within the larger problem of embodied intelligence.
Liu Fang said that in recent years, intelligent driving has essentially been imitation learning. Instead of relying on manual rule - writing, the system directly learns from a vast amount of data to independently find driving patterns. However, this also comes with challenges. Imitation learning cannot handle cases outside the existing data.
The implementation of new technologies such as VLA and reinforcement learning is bringing new ideas.
For example, the VLM (Visual Language Model) in the VLA (Vision - Language - Action) model itself has the ability to recognize the world. "The performance of VLM determines more than half of the performance of VLA. Most of the work of VLA is actually an enhancement based on VLM," Liu Fang said.
In addition to being able to describe pictures and perceive distances, the more crucial step for VLA is the final action stage. "It's like buying furniture and assembling it. First, you read the instructions and look at the examples, but whether you can do it well still depends on actual operation."
Liu Fang said that the final stage is essentially a process of "trying out." Robots will also conduct reinforcement learning during the final actual operation stage.
Reinforcement learning is a reward mechanism and strategy for AI training. If the intelligent driving system makes the right decision, it gets a "reward"; if it performs poorly, it gets a "punishment."
However, Liu Fang said that the biggest difference between reinforcement learning in autonomous driving and that in robotics is the problem of competitive gaming. "Robots don't have to compete with neighboring robots for a cup, but in the reinforcement learning of autonomous driving, it also includes how to imitate the reactions of opponents."
This may ultimately require a good world model simulator to solve the problem. But in the actual implementation process, it's impossible for a perfect world model to suddenly appear to assist in the simulation. "We can only say that part of the simulation of the world model allows the system to conduct reinforcement learning and improve. Then, find the simulation reactions that don't match the real situation, add some data to make the world model better, and iterate step by step. I believe that the world model and the driving model must be iterated together."
Liu Fang also said that currently, VLA is still in a stage of innovative confusion. The implementation paths of different companies in the industry vary, and it has not yet reached a convergent state.
Based on the VLA model, Amio Robotics is exploring its own path. Liu Fang introduced to 36Kr Auto that the company is currently mainly focused on the flexible production of robots in the 3C consumer electronics field.
He told 36Kr Auto that the life cycle of many electronic products is not long, and the production volume is small. The deployment cost of automated production lines is high, and it takes at least two months to implement. "But a production line only needs to operate for 3 - 4 months to meet the demand. From a cost perspective, the automation of 3C consumer electronics production lines is not cost - effective."
Based on the VLA model, Liu Fang said that it can transform previous dedicated robots into general - purpose robots, and their learning and adaptability can quickly catch up with human levels.
For example, providing a set of robot hardware and software at a fixed workstation in a factory to replace human labor in three - shift operations. Even if the 3C product production line undergoes flexible changes, the general - purpose robot can seamlessly switch between similar general tasks.
Currently, Amio Robotics has established a joint laboratory with Peking University, and the two parties are collaborating on the VLA base model. In terms of model training, Amio Robotics can also carry out pre - training with the assistance of its investor, Zhipu AI. Secondly, Amio Robotics has already started data collection in the factory.
In terms of business progress, Liu Fang said that a large general - purpose robot production line will be fully implemented in the third and fourth quarters of this year. In addition to the consumer electronics field, Amio Robotics will also expand into service fields, household cleaning and organization, and other scenarios in the future.
The following is a dialogue between 36Kr Auto and Liu Fang, the founder of Amio Robotics. The content has been edited:
36Kr Auto: Don't you consider building a robot production line for the automotive field?
Liu Fang: The labor intensity and demand in the automotive industry are indeed greater, but in essence, there is no demand for general - purpose applications. It is a good scenario for equipment intelligence but not a good scenario for embodied intelligence.
An automotive production line is used for 7 - 9 years, or at least 5 years. If there are better dedicated equipment to solve this problem, why not use dedicated equipment? Dedicated equipment is cheaper than general - purpose equipment, and there is no need to use general - purpose equipment.
36Kr Auto: Do you make the robotic arm suppliers yourself or look for external ones now?
Liu Fang: Now the fixtures can handle more than 80% of the tasks. Many industries don't need dexterous hands. First, the cost of dexterous hands is unaffordable. Second, in terms of the life cycle, customers require three - shift operation for a year, which means a lifespan requirement of at least 7,000 hours. Our current lifespan requirement is 8,000 - 10,000 hours, and fixtures can meet this requirement.
36Kr Auto: The profit of robot contract manufacturing is not high. How do you calculate your business model?
Liu Fang: First, contract manufacturing replaces human labor, and we calculate how much money it can save for customers. Second, the production capacity of the machines must keep up. From the perspective of labor cost, one worker costs 100,000 yuan a year.
There are two parts to the cost of a robot. The first is the fixed assets of the physical robot, and the second is the algorithm model. The upfront investment in the model is large, but the cost is spread out as it operates. A robot can work in three shifts, and one workstation can replace the cost of three workers.
36Kr Auto: Why did you choose to start a business in the field of consumer electronics production line robots instead of autonomous driving?
Liu Fang: My first job was at Google, where I worked on language models. After the release of GPT 3.5 in early 2023, it had a great impact on me.
First, a large amount of data can enable a larger model to generate intelligence. Feeding a large amount of data into a technical framework that is essentially imitation learning can generate intelligence even without seeing some cases.
Second, the large model itself has understood a lot of human knowledge. When learning new skills, one doesn't have to rely solely on imitation but can establish an understanding of things by learning the knowledge behind the imitation data. This is closer to AGI, which is also what the embodied intelligence VLA is doing. If this path can be successfully implemented, autonomous driving will become a low - dimensional matter, just a sub - problem of a larger issue.
36Kr Auto: So, is the implementation of VLA in autonomous driving a fairly certain thing?
Liu Fang: The implementation of VLA in robotics is a relatively certain thing. Lei Jun (CEO of Xiaomi) has always said that one should use high - dimensional methods to deal with low - dimensional problems. From a higher - dimensional perspective, if robots are well - developed, they can also drive, and autonomous driving will be a natural outcome.
36Kr Auto: What problems in the autonomous driving industry can VLA solve?
Liu Fang: Two problems. First, the amount of data is too large. It was impossible to cover all cases by writing rules before. Later, people adopted imitation learning, stopped writing rules, and directly learned from data, which made the efficiency higher. This is what Tesla talked about last year. But there is still a problem. Imitation learning cannot handle cases outside the data. This is also the greatest help that VLA can provide.
36Kr Auto: Can reinforcement learning solve the problem?
Liu Fang: Our method is to conduct reinforcement learning on VLA. Reinforcement learning is similar to human learning. People usually first learn basic abilities, such as describing pictures and knowing distances. Second, they need to know how to perform actions, which requires actual operation. For example, when buying furniture and assembling it, one first reads the instructions and looks at the examples, but whether one can do it well still depends on actual operation. This step is essentially a process of "trying out." We only conduct reinforcement learning during the final actual operation stage.
Since robots don't have a large simulation environment to simulate interactions, they can only conduct a large number of experiments in reality. And the time and number of times a robot can directly learn and try are limited, so a robot cannot start reinforcement learning from scratch. The general learning logic and direction of VLA are correct. For the things that don't go well in the middle, we rely on reinforcement learning at the end. We call this residual reinforcement learning, which learns the deviation between the VLA model and the actual environment, rather than applying reinforcement learning to all aspects.
36Kr Auto: Is it difficult to use reinforcement learning in autonomous driving?
Liu Fang: Actually, it's more difficult. The biggest difference between autonomous driving and robotics is the problem of competitive gaming. Robots don't have to compete with neighboring robots for a cup, but in the reinforcement learning of autonomous driving, it also includes how to imitate the reactions of opponents.
Either collect data in the real environment, but some data is difficult to collect; or generate the reactions of opponents in a simulation environment, but the generated data may not necessarily cover the data distribution required for training. When the exploration space is not large enough, reinforcement learning will not produce actual results.
36Kr Auto: Then how to solve it? Is the world model useful?
Liu Fang: If the simulation ability is very strong and the imitation of the reactions of different objects in the world model is well - done, then there won't be an out - of - distribution (OOD, which means that when a model is trained on a specific data distribution, its performance may deteriorate when dealing with data distributions different from the training data) problem.
This is a logical paradox, and I haven't figured it out yet. It may be a step - by - step iterative process. It's impossible for a perfect world model to suddenly appear to assist in the simulation. We can only say that part of the simulation of the world model allows the system to conduct reinforcement learning and improve. Then, find the simulation reactions that don't match the real situation, add some data to make the world model better, and iterate step by step. I believe that the world model and the driving model must be iterated together.
36Kr Auto: Recently, Li Auto said that VLA has entered uncharted territory. Do you agree?
Liu Fang: Innovation is not easy. VLA is indeed still in a state of confusion, and different people have different paths and implementation methods. For example, the solution of Pi Robotics is different from those of Facebook, Google, ByteDance, and ours.
Although they are all VLA, there are differences in various details, algorithm design, and data usage. The VLA for robots has not reached a convergent state because no one has created a 100% reliable product yet.
Unlike autonomous driving, where Tesla has set a benchmark and commercialized it. So far, no such thing has happened in the field of robotics, but this is also an opportunity for entrepreneurs.
36Kr Auto: Is this related to the many implementation scenarios for robots?
Liu Fang: Since VLA has not converged, there is no one - size - fits - all experience. Our experience is that the performance of VLM determines more than half of the performance of VLA. Most of the work of VLA is an enhancement based on VLM.
At the same time, the spatial ability and semantic understanding ability of VLM after space are very poor. That is to say, it doesn't know the position of things in the image in 3D space, nor can it know the 3D correspondence between two objects. We hope to enhance the perception ability of VLM through 3D enhancement.
Then, we need to add the ability to understand actions back. We solve this problem through a generative model. Previously, when working on language models, many intermediate steps were required, but GPT 3.5 shows that we can just generate directly without those steps. This is the same as the view of physicist Feynman, "Only what I create is what I can understand."
36Kr Auto: Has there been any change in the underlying technology of VLA? It's still in the Transformer paradigm now.
Liu Fang: It may not be obvious in the short term. But recently, autoregressive learning and generative models have developed relatively quickly, which may greatly improve the model performance.
36Kr Auto: What do you think the terminal in the AGI era will be?
Liu Fang: I think functional products will be more direct and intuitive. Robots that can perform tasks are what I want to work on. I don't really understand emotional companionship - type, game - type, or toy - type products. I can only do things that I understand.