Zhuoyu's Yu Beibei: Transition to Physical AI - An Inevitable Survival Choice

Algorithm vendors are entering a knockout stage in a brand-new dimension.

Text by | Xiao Man

Edited by | Li Qin

In the current intelligent vehicle field, physical AI has become a high - frequency term. The vast majority of intelligent driving algorithm manufacturers are transitioning to "physical AI".

Zhuoyu also released a native multi - modal foundation model for mobile physical AI at the Beijing Auto Show. In the view of Yu Beibei, the vice - president of Zhuoyu Technology, the transition of algorithm manufacturers to physical AI is not an imagined scenario fabricated to cater to the capital market, but a survival rule that concerns the survival of manufacturers.

"If you don't adopt this technological route, it's very likely that you won't succeed in the future," Yu Beibei said.

In the new competitive dimension, the opponents of algorithm manufacturers are no longer just their former peers, but also include giants from the digital AI field who are crossing over, and embodied intelligence companies.

This brand - new competition has pushed algorithm manufacturers into an elimination round in a new dimension. For the players who can truly succeed in this competition, their business space will also expand.

Based on the mobile base model, Zhuoyu has begun to try to break the single logic of traditional Tier 1 suppliers of "selling hardware and charging development fees". In the second growth curve, by extending passenger car technology to L4 - level fields such as Robotaxi and RoboVan, Zhuoyu is exploring a new business form based on subscriptions, profit - sharing, and "Action Tokens".

Recently, 36Kr Auto had a conversation with Yu Beibei, the vice - president of Zhuoyu Technology, about the underlying logic of physical AI, the possibility of commercialization, and how Zhuoyu should build a moat in this upcoming elimination round.

The following is the edited content of the conversation between 36Kr Auto and Yu Beibei, the vice - president of Zhuoyu Technology:

36Kr: Could you introduce the native multi - modal foundation model in detail?

Yu Beibei: The concept of native multi - modality can be traced back to when we started working on VLA 1.0 last year. At that time, our approach was closer to a model that aligns vision and actions, with a large language model attached later. Therefore, there were many problems, such as limitations in language and semantic understanding, and response delays.

We believe that translating all information into a language space for understanding and then trying to understand the physical world through the results of this language translation is an anti - common - sense approach.

The truly reasonable path is that vision, audio, and actions are each a modality, and rules or inferences are also a modality. All of these should be incorporated during the pre - training phase, allowing the model to inherently understand the physical world in a common space of multiple modalities. This is a more appropriate approach.

36Kr: Have you removed the language modality now?

Yu Beibei: Currently, our in - vehicle model has not yet opened the language input channel. This is similar to the VLA 2.0 released by XPeng. We are working in a similar direction and are both switching to this paradigm. The underlying backbone network has changed.

36Kr: Has Zhuoyu also entered the VLA2.0 stage?

Yu Beibei: Yes. The industry is at a turning point in paradigm shift. The choice before us is: should we continue along the previous paradigm of building small expert models, or should we decisively switch to the large - model paradigm?

We are quite optimistic about the large - model paradigm. In the context of mobile physical AI, if we hope that mobile capabilities can be used on various vehicles, it essentially reaches the stage of large - scale application.

The historical experience of large language models tells us that when building vision - language models in the past, some people built expert models, while others built general models, that is, the so - called base models.

Looking back now, it was the group that built base models that ultimately succeeded. Those expert models that focused on specific tasks in the past did not really succeed. In the field of physical AI, we believe that the evolution law is the same, so we will firmly follow the paradigm of the foundation model.

36Kr: Many manufacturers are doing this, but so far, no one has truly trained a model that can be uniformly accessed by various different carriers. Essentially, everyone is still solving problems related to vehicles.

Yu Beibei: This is a phased process. By 2025, most people will have switched to data - driven methods, which means that the basic capabilities of the model have reached about 70 points. At this time, to improve it to 90 points, the remaining 20 - point gap still requires post - training, data collection, and generalization. However, the gap has narrowed from the previous 40 - point to 80 - point gap to the current 70 - point to 90 - point gap.

Subsequently, as the basic capabilities of the model further improve, our goal is definitely to achieve zero - shot generalization, that is, the so - called "out - of - the - box" usage.

If the model's capabilities can reach 95 points out - of - the - box, then subsequent post - training, generalization, and city - expansion work can almost be ignored. Although we haven't reached the 95 - point out - of - the - box level yet, we have reached 70 points.

36Kr: At this stage, has Zhuoyu integrated various scenarios into the same model and run it in practice, and do you think it can be mass - produced and generalized in various fields, or is it still in a relatively early stage?

Yu Beibei: At this point, it's far from being "out - of - the - box" ready. There is currently no consensus in the industry on what the ultimate paradigm of physical AI is and what kind of architecture can truly understand the physical world.

36Kr: What do you think of the phenomenon that most solution providers are transitioning to the physical AI direction? Is this just a more imaginative story for the capital market?

Yu Beibei: We believe that this is no longer just a business or strategic choice. Ultimately, it will rise to the level of a survival rule. If you don't adopt this technological route, it's very likely that you won't succeed in the future.

This is similar to the eve of the explosion of large language models. In the past, there were many expert models for specific tasks, but once the general large - scale model emerged, it replaced them all, and those previous models did not succeed.

36Kr: When building a general model under this paradigm, are the data in other scenarios or other conditions required for pre - training still insufficient?

Yu Beibei: When training our own foundation model, 30% of the data comes from real - world data collected by vehicles, 30% from robots, and the other 40% from the Internet.

For data on mobile capabilities, in fact, on the Internet, we only need to obtain first - person - perspective videos during movement. This doesn't have to be from passenger cars or commercial vehicles; it can also be videos taken while a person is walking. The scale of this type of data is large and relatively easy to obtain.

Many companies claim to be working on mobile physical AI. While model capabilities are one aspect, more importantly, embodied intelligence must be deployed on a specific piece of hardware, and its distribution process is difficult. It's not like digital AI, which can spread virally from one user to hundreds of millions of users through mobile phones, with extremely fast dissemination.

Therefore, building a distribution platform and distribution network is also a very crucial part, which concerns how to specifically deploy this capability on mobile vehicles and physical entities.

36Kr: How does Zhuoyu handle distribution?

Yu Beibei: We have our own methods. For example, we cooperate with partners to define hardware standards. After defining these hardware standards, we authorize and distribute the hardware through partners. This is the hardware distribution part.

In terms of software distribution, for example, our mobile capabilities SDK can encapsulate model capabilities into an SDK and provide it to partners who do not have the ability to post - train models. We can also package it as "Mobile AI". That is, after making the model good enough, we open - source it, allowing other parties to conduct post - training based on this model. This is another distribution method.

We can also directly create "Mobile Agents". In the future, for some low - security, low - real - time applications, such as cleaning robots or lawn mowers, we only need to transmit the video stream to the cloud. After the cloud computes the result, it directly sends a trajectory to the small machine. This may be another distribution method.

36Kr: Do these distribution methods correspond to Zhuoyu's commercial charging models?

Yu Beibei: Yes, and they are also aimed at different business scenarios.

The traditional method, such as in the passenger car or commercial vehicle business, is to sell hardware, sell software licenses, and charge development fees and non - recurring engineering fees. We internally call this the business of the first growth curve.

The second growth curve is to extend the technologies that have been verified in passenger cars to fields such as Robotaxi and RoboVan. Although we also sell hardware and may charge development fees, we generally do not charge software license fees.

The software part generates revenue through profit - sharing. For example, in L4 - level business, as a service provider, we need to continuously participate in software iteration and even get involved in operations. Therefore, we need a continuous source of income, which has evolved into a subscription and profit - sharing model.

36Kr: It sounds like the second growth curve is more profitable.

Yu Beibei: Compared with the revenue from the first growth curve, its profit structure is better.

We may have different algorithm distribution methods. Taking the "Mobile Agent" as an example, this distribution method is a bit like distributing the so - called "Action Tokens".

It's equivalent to a consumer - grade electronic device transmitting a video stream to a cloud - based inference model, and the model then sends a trajectory. The charging model may be to charge a fee similar to "Action Tokens" based on the usage times and driving mileage of the consumer - grade device. This is another form of subscription.

36Kr: Will Zhuoyu handle all aspects of subsequent operations and maintenance?

Yu Beibei: For L2 systems, there is no need for operations and maintenance. Only L4 - level systems involve operations and maintenance, which requires a so - called remote monitoring system to constantly monitor the vehicle's operation process and take over remotely when necessary.

This is a bit like the OnStar service in the past. You need to pay a fee to use this service. Once a vehicle enables L4 functions, whether it's for trunk logistics or passenger cars, as long as L4 is enabled, an additional fee needs to be paid.

In the future, when the sensor configuration and computing power configuration of passenger cars can support L4 - level functions, the owner may usually still use the L2+ system. When they need to enable the L4 function, they need to pay an additional fee for each kilometer driven in L4 mode because there will always be a system monitoring it.

36Kr: Do you think the business models of L2 and L4 will be completely different?

Yu Beibei: Yes, the business models of L2 and L4 are completely different. From our perspective, we believe that L4 should be first implemented in urban areas and then extended to highway scenarios.

From an engineering safety perspective, for an accident of the same nature, the degree of harm on the highway is much more serious than in urban areas.

36Kr: Is the fact that industry players are moving towards physical AI the start of a new round of elimination?

Yu Beibei: A new round of industry reshuffle may be about to begin. All companies engaged in autonomous driving will, in the near future, transform into mobile physical AI companies.

If competing in the mobile physical AI track, this itself becomes a cross - border competition. It may not even be a competition among existing players in this industry anymore. We also need to compete with some players who originally worked in digital AI and now want to transform into embodied intelligence and physical AI.

36Kr: What exactly is Zhuoyu's moat?

Yu Beibei: We believe there are two points. First, it's model capabilities. There is currently no consensus on the iteration paradigm and the final model architecture to be adopted. Maybe we think that new architectures such as 3D DiT or V - JEPA will emerge in the future, but these are all unknowns.

Second, distribution capabilities are actually a very high threshold. How to build a distribution platform and distribution network, create an ecosystem, and collaborate with different partners for distribution is definitely a very high threshold.

This article is originally produced by「肖漫」， For reprint or content cooperation, please click Reprint Instructions ；Unauthorized reprint will be held accountable.

Zhuoyu's Yu Beibei: Transitioning to Physical AI is an Inevitable Choice for Survival | Frontline