HomeArticle

Li Auto Invests Heavily with 15 Billion R&D Funds, Releases 12 Top Conference Papers in a Row, Fully Unveils Its Autonomous Driving "Family Assets"

智东西2026-06-09 08:26
Full updates to the world model, end-to-end approach, and VLA architecture.

12 Papers from Li Auto Selected for Top Computer Vision Conference CVPR!

According to Che Dongxi on June 8th, recently, the top conference in the field of computer vision and pattern recognition, CVPR 2026, was held. 12 papers from Li Auto were selected, and multiple papers also participated in on - site discussions and presentations.

As one of the three top conferences in computer vision, along with ICCV and ECCV, the significance of having 12 papers selected at once is self - evident.

The 12 papers selected from Li Auto cover key areas such as world models, end - to - end planning, multi - modal perception, reinforcement learning, cognitive models, and language and visual intelligence.

It can be said that Li Auto's intelligent competition is evolving from product features to underlying models, simulation, safety, and reasoning capabilities.

Behind this is the result of Li Auto's continuous investment in R & D in recent years.

According to Li Auto, as of the end of the first quarter of 2026, Li Auto has maintained a R & D investment of about 3 billion yuan for five consecutive quarters. This means that Li Auto has invested about 15 billion yuan in R & D in five quarters, and the annual R & D expenditure in 2025 also reached 11.3 billion yuan.

In the past five years, Li Auto has published nearly a hundred papers in top conferences and journals such as CVPR, ICCV, ECCV, NeurIPS, SIGGRAPH, IROS, and ICRA.

However, rather than simply looking at "how many papers were published", it is more important to focus on what problems these 12 papers solve.

Che Dongxi dissected these 12 papers to summarize four main lines of Li Auto's underlying technologies in autonomous driving.

01.

Achieved Four Breakthroughs in World Models

Upgraded the Foundation of Simulation and Safety

In the field of autonomous driving, the world model aims to solve the problem of whether a vehicle can understand and simulate the world before taking action.

Four papers from Li Auto in the field of world models were selected for CVPR 2026. They cover four aspects: depth estimation, 3D reconstruction, traffic rule cognitive evaluation, and safety risk prediction, forming a technical chain from "restoring the real world" to "understanding traffic rules" and then to "predicting dangerous consequences".

How will the road structure change? How might other traffic participants move? Will a certain trajectory bring risks? How should we choose between complex traffic rules?

For autonomous driving on real roads, the world model is not only the basis for simulation but also an important foundation for improving safety and the ability to handle long - tail scenarios.

▲ Schematic diagram of the InfiniDepth high - precision continuous depth estimation method

In terms of geometric understanding, InfiniDepth (a high - precision continuous depth estimation method) focuses on the most basic and crucial issue when a vehicle understands the 3D world - depth.

Traditional depth estimation methods usually predict results on fixed - resolution image grids, which are easily limited by resolution, and the fine structures and geometric boundaries are not detailed enough.

InfiniDepth represents depth as a continuous neural implicit field, allowing the model to query depth at any 2D coordinates, thus supporting higher - resolution and finer - grained depth estimation and showing advantages in fine - area and new - perspective synthesis tasks.

For vehicle scenarios, this ability helps to more accurately restore the 3D structures of roads, vehicles, obstacles, etc., providing a more reliable geometric basis for subsequent simulation and environmental modeling.

In this way, the vehicle can more precisely determine the distance of each object in the picture, laying the foundation for 3D environment restoration and simulation modeling.

▲ Ability of Unposed - to - 3D to generate 3D vehicles from real driving images

In terms of simulation asset construction, Unposed - to - 3D (an ability to generate 3D vehicles from real driving images) addresses another practical problem: where to obtain high - quality 3D vehicle assets.

The paper points out that existing 3D vehicle generation methods often rely on synthetic data for training, which has a domain gap with real - world road images. The generated results may also have problems such as inconsistent postures and inaccurate scales, making it difficult to directly use them in driving simulation environments.

Unposed - to - 3D learns 3D vehicle reconstruction from real driving images through a two - stage framework and introduces scale - awareness and appearance - coordination modules, making the generated vehicles more suitable for real - world driving scenario simulations in terms of size, posture, and lighting appearance.

This means that in the future, the construction of large - scale and diverse simulation traffic environments can rely less on manual modeling and obtain available assets from the real world more efficiently.

▲ DriveCombo evaluation framework for complex traffic rule reasoning

The world model not only needs to "see accurately" and "build realistically" but also understand the rules in the traffic world. DriveCombo released by Li Auto is a benchmark for complex traffic rule reasoning.

The paper points out that existing traffic rule evaluations often focus on single - rule scenarios, such as traffic sign recognition or simple right - of - way judgment. However, in real - world driving, multiple rules often appear simultaneously and may even conflict.

DriveCombo constructs a combined traffic rule reasoning benchmark that combines text and vision and proposes a five - level cognitive ladder, gradually improving from single - rule understanding to multi - rule integration and conflict resolution.

Evaluations of 14 mainstream multi - modal large models show that as the task complexity increases, the model performance decreases systematically, especially in rule - conflict scenarios.

In simple terms, DriveCombo is not a driving model but a set of "exam questions" used to test whether multi - modal large models can understand complex traffic rules, especially how to make judgments when multiple rules conflict.

▲ Overall framework of the AD - R1 fair world model for safety prediction

In addition, safety prediction is a key step for the world model to move towards closed - loop training. AD - R1 focuses on a core problem in end - to - end driving reinforcement learning: if the world model is only trained on safe expert data, it may form an "optimistic bias" - when facing dangerous trajectories, it still tends to predict a seemingly safe future, such as ignoring collision or road boundary risks.

AD - R1 proposes the concept of a "fair world model". By generating risk scenarios such as collisions and driving off the road through counterfactual synthesis, the model learns to realistically predict dangerous consequences and serves as an internal critic in closed - loop reinforcement learning to provide safety feedback for candidate actions.

In other words, the model not only learns "how a good driver drives" but also "what consequences wrong actions will lead to". This has direct significance for improving the system's reliability in long - tail risk scenarios.

In this way, the world model is no longer just about generating realistic pictures or scenarios but evolving into a more complete intelligent system that is "inferable, evaluable, and trainable".

These four studies together form Li Auto's systematic layout in the field of world models and provide more solid technical support for intelligent driving to move from "seeing the world" to "understanding the world, simulating the world, and avoiding risks".

02.

Cognitive Alignment and Language and Visual Intelligence

Make Model Reasoning More Accurate and Faster

The world model is crucial on the training side, while cognitive alignment, language, and visual intelligence are also very important on the reasoning side.

For a vehicle to move from "seeing the road" to "understanding the road", the model needs not only recognition ability but also continuous cognition, language understanding, action generation, and efficient deployment capabilities.

The key is how to make the model not only "recognize accurately" but be able to understand continuously, align accurately, reason efficiently, and finally execute reliably.

Li Auto has presented five key studies to address these issues. CogDriver improves the temporal stability of driving decisions, LinkVLA bridges language understanding and action generation, FastMMoE reduces the reasoning cost of multi - modal large models, CoV - Align improves the fine - grained alignment efficiency between vision and language, and Switch - KD makes it easier to transfer the capabilities of large models to lightweight models.

Together, they form Li Auto's technical accumulation in the fields of cognitive models, language intelligence, and visual intelligence, and enable the vehicle to move from "seeing and judging" to "understanding, reasoning, and acting".

▲ Schematic diagram of the method of CogDriver to improve the temporal stability of driving decisions

In terms of driving cognition, CogDriver focuses on the shortcoming of current vision - language models in temporal understanding, which helps the system better understand driving scenarios. When many models process driving scenarios, they are more like "describing pictures frame by frame", lacking memory of historical states and continuous intentions, which easily leads to decision - making jitters and makes it difficult to complete complex continuous actions.

CogDriver introduces a "cognitive inertia" mechanism, provides temporal supervision through a large - scale vision - language - action dataset, and adds a sparse temporal memory module to the agent, enabling the model to form a more stable internal state.

Experiments show that CogDriver improves the closed - loop driving score on Bench2Drive by 22% and reduces the average trajectory error on nuScenes by 21%, indicating that temporal consistency has direct value in improving planning stability.

It is not difficult to see that CogDriver adds "memory" and "inertia" to the driving model, so that when making decisions, it no longer only looks at the current frame but combines the previous and subsequent states to keep the judgment stable.

If CogDriver solves the problem of "continuous understanding", then LinkVLA further addresses "how to act after understanding".

Vision - language - action models are considered an important direction for end - to - end driving, but existing methods often have two problems: one is the misalignment between language instructions and action outputs, and the other is the low reasoning efficiency caused by step - by - step action sequence generation.

▲ Overall framework of LinkVLA to bridge language understanding and action generation

LinkVLA unifies language and actions into a shared discrete codebook to strengthen cross - modal consistency in structure; at the same time, it introduces an action understanding auxiliary task, enabling the model to go from language to action and also infer semantic descriptions from trajectories.

It also uses a two - step generation method from coarse to fine to replace the traditional step - by - step decoding, saving 86% of the reasoning time while improving instruction following and driving performance in the closed - loop driving benchmark.

In this way, LinkVLA can make the system have lower latency and be smarter.

After the model becomes smarter, another practical problem is whether it can run faster and lighter.

▲ Schematic diagram of the FastMMoE training - free acceleration method for multi - modal large models

FastMMoE proposes a training - free acceleration framework for multi - modal large models with the MoE architecture. Starting from the routing behavior, on the one hand, it reduces the unnecessary expert activation of visual tokens, and on the other hand, it identifies and prunes redundant visual tokens according to the routing probability distribution.

Compared with simply judging which tokens can be deleted based on attention weights, FastMMoE is closer to the calculation mechanism of the MoE model itself.

Experiments show that on models such as DeepSeek - VL2 and InternVL3.5, FastMMoE can reduce FLOPs by up to 55% while retaining about 95.5% of the original performance.

This method is very helpful for scenarios such as vehicle terminals and cockpits that are sensitive to latency and computing power. It "lightens the load" of multi - modal large models, reduces the computational volume without losing much ability, and makes the model run faster.

Meanwhile, in multi - modal understanding, whether language and vision can be accurately aligned also determines whether the model truly "understands".

▲ Schematic diagram of the CoV - Align fine - grained alignment method between image regions and language descriptions

CoV - Align focuses on the fine - grained alignment between image regions and text descriptions. Traditional methods often rely on text guidance to aggregate image regions, which easily leads to redundant patch - word matching and high computational costs.

CoV - Align proposes the idea of "cohesive visual semantics first". It first aggregates visual regions with consistent semantics without relying on text and then performs cross - modal alignment. This not only reduces noise but also improves efficiency.

The paper shows that CoV - Align achieves leading performance on image