HomeArticle

NVIDIA open-sources its autonomous driving VLA for the first time, and Robotaxi enters the "Android era". The product is created by an all-Chinese team led by WU Xinzhou.

智能车参考2025-12-02 15:52
Mr. Huang can no longer hide his ambition.

What's it like to make a Robotaxi as easily as making a mobile phone?

NVIDIA has just unveiled Alpamayo-R1 at the top AI conference NeurIPS -

NVIDIA's first autonomous driving VLA, with performance over 30% better than traditional end-to-end systems, and it's open-source right upon release.

With the trend of multi-modal large models dominating autonomous driving, the threshold for Robotaxi might truly be significantly lowered: Just buy NVIDIA chips directly, and the algorithms can also draw on and transplant NVIDIA's VLA. Then, make some customizations for style and scenarios... It's exactly the same as mobile phone manufacturers.

Jensen Huang's ambition in autonomous driving is no longer hidden: NVIDIA aims to become the "Android" in the autonomous driving race.

What problems does Alpamayo-R1 solve?

Interestingly, NVIDIA is the "pioneer" of end-to-end autonomous driving, yet the main problem the latest VLA research aims to solve is the flaws of end-to-end systems.

End-to-end systems complete the process from perception to control in one go. Trained with data from experienced human drivers, they can theoretically mimic human behavior infinitely and handle various corner cases.

However, the reality is that mass-produced systems still often malfunction - for example, oncoming vehicles making illegal left turns, pedestrians suddenly stepping into the road, temporary construction, and traffic signs being blocked.

NVIDIA attributes the root cause of end-to-end systems failing in corner cases to the limitations of end-to-end systems - they can see but not understand, which is the so-called "black-box" characteristic.

NVIDIA's approach to solving the problem is the vision-language-action model, the popular VLA.

Let's first look at the results:

Both the Baseline model for comparison and Alpamayo-R1 are trained on the CoC dataset built by NVIDIA, which is also an important part of this research.

CoC stands for the chain of causation, which is an important basis for the model's interpretability.

The Baseline model in the comparison experiment is a pure trajectory output model trained on the CoC dataset and doesn't have inference capabilities.

The performance improvements in the experiment are as follows:

The planning accuracy is improved by 12%, the boundary violation rate is reduced by 35%, the near-collision rate is reduced by 25%, the inference-action consistency is improved by 37%, and the end-to-end latency is reduced to 99ms.

Therefore, the improvements of Alpamayo-R1 are mainly reflected in scenarios where errors were most likely to occur before - that is, it's closer to a "driver who can truly make judgments".

Previously, we had no idea whether end-to-end systems could understand. So now, how does NVIDIA confirm that the model can "understand"?

How is it solved?

The important work of Alpamayo-R1 includes three points. First is the Chain of Causation (CoC) dataset mentioned earlier.

This is a brand-new data annotation system. Each piece of driving data not only shows "what was done" but also "why it was done". For example, "Slow down and change lanes to the left because there's a moped waiting at the red light ahead and the left lane is empty":

CoC is an evolution and extension of CoT, mainly focusing on "causation", which basically avoids problems such as vague behavior descriptions, confused causal relationships, and the disconnection between behavior inference and causal relationships in the CoT dataset.

Of course, the annotation and calibration of CoC still rely on human labor.

AR1 itself is based on NVIDIA's Cosmos Reason model, which is an inference vision-language model designed specifically for Physical AI:

The most prominent feature of the overall structure is that it's based on causal structured reasoning rather than free narrative. This means the model must explain the safety and compliance of operations based on historical observable evidence -

This is the second important innovation, using the Multi-Stage Training strategy:

First, perform modality injection on large-scale driving data to learn the basic mapping from vision to action.

In the second stage, conduct supervised fine-tuning on the CoC causation chain data to teach the model to "think clearly before driving".

Finally, further optimize the inference quality, inference-action consistency, and trajectory safety through reinforcement learning (RL).

This phased and goal-oriented training process makes the model more robust in open scenarios and long-tail dangerous scenarios.

At the trajectory output stage, AR1 introduces a trajectory decoder based on the diffusion model, which can generate continuous and dynamically feasible driving trajectories under real-time constraints. This module combines language inference output with physical constraints to achieve a seamless connection from inference to control:

The basic principle of the diffusion model is to gradually add noise to the data through a forward process until the data becomes completely random noise, and then gradually remove the noise through a backward process to generate new data samples.

This generation method allows the model to capture the complex distribution of the data and generate diverse samples by controlling the process of adding and removing noise.

Let's summarize the process and principle of AR1. Like other autonomous driving systems, the input consists of multi-camera and multi-temporal observation frames, and high-level language input (such as navigation instructions or driving goals) can be optionally selected.

All inputs (including the historical movement of the vehicle itself) will be uniformly encoded into a multi-modal token sequence, arranged in chronological order and by sensor, and then fed into the backbone model Cosmos-Reason for inference and prediction.

Each camera view first goes through a lightweight CNN and a temporal attention module for feature compression and temporal modeling, and then is fused into a BEV (bird's-eye view) representation. After that, all modalities (images, navigation text, and vehicle state) are tokenized and uniformly input into the Transformer.

The model's output includes three types of tokens: reasoning traces, meta-actions, and future trajectory predictions.

The biggest innovation is that the multi-modal autonomous driving model has interpretable semantic understanding capabilities and can be linked with motion state perception to achieve input-output with a clear causal relationship.

Where does Alpamayo-R1 come from?

Alpamayo-R1 can be regarded as a VLA model, but it's fundamentally different from the common "end-to-end + large language model add-on" VLA in the industry.

Alpamayo-R1 is a completely native multi-modal model, based on Cosmos Reason in the Cosmos basic world model released by NVIDIA at CES at the beginning of the year.

Cosmos is actually the "intermediate layer" that links AI and the physical world for NVIDIA, providing the most basic Physical AI "Android" template for all industries - a "generalist" world model.

The training methods for the base model are the diffusion model and the autoregressive model. For the diffusion-based WFM, the pre-training includes "text-to-world generation pre-training" and "frequency-to-world generation pre-training"; for the autoregressive WFM, the pre-training includes "next token generation" and "text-conditioned video-to-world generation".

For Alpamayo-R1, the pre-training is actually the training process of the CoC dataset.

The base model of Alpamayo-R1 itself is Cosmos Reason, an extension of Cosmos' AI inference model. Its main ability is to understand video data through chain-of-thought reasoning.

This Alpamayo-R1 actually confirms the new layout that Jensen Huang has planned for NVIDIA in the AI wave - beyond computing infrastructure, NVIDIA also wants to be the underlying "Android" for physical AI in areas such as robotics and autonomous driving.

For Alpamayo-R1, rather than emphasizing the capabilities of the base model, it's more about Jensen Huang promoting the architecture paradigm and training method of this VLA - Alpamayo-R1 is flexible and open and can be compatible with various large base models.

The real value of this research lies in the brand-new annotation system of the CoC dataset and the large model paradigm that can use chain-of-thought reasoning to infer scenario causal relationships.

Jensen Huang has repeatedly stated that Physical AI is the next "hotspot" in artificial intelligence recognized by NVIDIA. The most crucial part is to build the "intermediate layer" that links the physical world and AI. Companies and individuals in all industries, even those without strong AI algorithm capabilities, can use powerful base models and process tools to create their own products.

Recently, NVIDIA officially announced its Robotaxi strategy, with vehicles and solutions, and has signed Uber as a partner.

The real goal is to break the current "closed" model of Robotaxi.

At the underlying hardware layer, the driver interfaces of chips and sensors are unified. Whether a car company uses lidar from Hesai or other suppliers, it can be directly adapted to NVIDIA's algorithms, avoiding R & D inefficiencies caused by hardware incompatibility.

At the core algorithm level, it's the just - open - sourced Alpamayo-R1, which provides the basic capabilities for L4 - level autonomous driving and supports players to perform customized optimization through APIs. For example, pedestrian recognition can be enhanced in campus scenarios, and lane - changing logic can be optimized in highway scenarios.

As for the functional interfaces for functions such as ride - hailing, dispatching, billing, and maintenance at the upper layer, capable ride - hailing platforms can connect their own apps, and NVIDIA can also directly open them at the bottom layer. One can quickly launch Robotaxi services just by connecting to the interfaces.

If Elon Musk's multi-modal large model route poses an unprecedented technological challenge to traditional L4, then NVIDIA's open - sourcing of Alpamayo-R1 actually impacts the entire Robotaxi business model -

The window period for L4 players to build their own fleets and platforms is getting shorter and shorter.

With NVIDIA's full - stack hardware and software solutions, many traditional ride - hailing platforms and taxi companies that originally didn't have the strength to enter the Robotaxi field can now "use it right out of the box".

So, will the autonomous driving race eventually form a situation where "Android" and "Apple" confront each other?

Will it be NVIDIA and Tesla corresponding to Android and Apple?

One more thing

The "pioneering work" of end - to - end systems is actually NVIDIA's 2016 paper End to End Learning for Self - Driving Cars, although the architecture at that time was still based on traditional convolutional neural networks.

After the power of the Transformer became apparent, the end - to - end concept was first applied and mass - produced by Tesla and is still the most important path for the transformation of the automotive industry today.

However, for leading players and the "pioneer" NVIDIA, end - to - end has now become an old - fashioned technology paradigm that needs to be "subverted". Leading this cutting - edge technology exploration at NVIDIA is our old acquaintance - Wu Xinzhou: