FSDrive: Vereinheitlicht VLA und Weltmodell, treibt automatische Fahrweise in Richtung visueller Inferenz voran

Den autonomen Fahrbetrieb von der "Symbolischen Inferenz" hin zur "Visuellen Inferenz" vorantreiben

Multimodal large models for autonomous driving mostly use text or symbols as intermediaries in the "chain of reasoning", which easily leads to blurred spatial - temporal relationships and loss of fine - grained information. FSDrive (FutureSightDrive) proposes "Spatio - Temporal Visual Chain - of - Thought" (Spatio - Temporal CoT), enabling the model to directly "think with images". It uses unified future image frames as intermediate reasoning steps and conducts visual reasoning by combining future scenarios and perception results. Without modifying the original MLLM architecture, this method activates the image generation ability through "vocabulary expansion + autoregressive visual generation" and injects physical priors through a progressive visual CoT from easy to difficult. The model acts both as a "world model" to predict the future and as an "inverse dynamics model" for trajectory planning.

Project homepage: https://miv-xjtu.github.io/FSDrive.github.io/
Paper link: https://arxiv.org/abs/2505.17685
Code address: https://github.com/MIV-XJTU/FSDrive

Multimodal large language models (MLLMs) are accelerating their entry into the end - to - end "vision - language - action" (VLA) autonomous driving paradigm with their world knowledge and interpretable reasoning ability. However, existing methods mostly rely on discrete text CoT (such as rule descriptions and coordinates), which is essentially a high - level symbolic compression of visual information, and there are problems of cross - modal semantic gaps and insufficient representation of spatio - temporal relationships.

Core question: For autonomous driving with in - depth interaction with the physical world, should the thinking process be closer to the visual deduction of "simulation and imagination" rather than pure symbolic logic?

FSDrive proposes "Spatio - Temporal Visual CoT", which unifies the generation of future scenarios and perception results (lane lines, 3D detection boxes) into a single future image frame as an intermediate reasoning step. On the one hand, ordinary future frames carry temporal evolution; on the other hand, "red lane lines and 3D boxes" provide spatial priors for drivable areas and key dynamic objects, thus completing causal inference and decision - making planning in the visual domain.

Key innovations in this paper:

1) A unified "visual intermediary" replaces text/table intermediaries, eliminating cross - modal semantic gaps;

2) "Activate" the image generation ability on an existing MLLM at a minimal cost: Only introduce VQ - like visual tokens by expanding the vocabulary, without major architectural modifications or massive training;

3) Progressive visual CoT: First generate a coarse - grained perception map ("physical constraints") of lane lines/3D boxes, and then generate a detailed future frame, explicitly injecting physical rationality.

Value: It maintains an end - to - end simple link and interpretable visual reasoning, and at the same time can use unlabeled video data on a large scale to learn the laws of world evolution.

Method

Overall framework of FSDrive:

Input: Surround - view images and task instructions; Output: A unified future frame (with red lane lines/3D boxes overlaid) as spatio - temporal CoT, and the final trajectory.
Dual roles: The model first acts as a "world model" to generate a unified future frame (spatio - temporal CoT), and then acts as an "inverse dynamics model" to conduct trajectory planning based on current observations and future predictions.

Unified pre - training paradigm: Understanding + Generation

Maintain understanding: Follow the VQA task (such as the OmniDrive - nuScenes/DriveLM style) to maintain the semantic understanding ability of the original MLLM.
Activate generation: Without modifying the MLLM structure, only incorporate visual tokens from VQ - VAE/MoVQGAN into the LLM vocabulary, expanding it to "shared vocabulary for images and text". Then directly generate image tokens in an autoregressive next - token prediction manner and restore pixels by the detokenizer.
Data - efficient: Compared with some unified understanding - generation methods, it requires about 0.3% of the data volume and does not need to be trained from scratch or use complex decoder fusion.

Progressive visual CoT (physical prior → detail completion)

First infer future lane lines (Ql): Indicate the drivable area and inject static physical constraints;
Then infer future 3D detection (Qd): Characterize the motion patterns of key dynamic objects and inject dynamic constraints;
Finally, generate a complete future frame (Qf) under the above constraints: Complete details and improve authenticity and consistency.
Use this "from easy to difficult" order during the training phase, and integrate the three into a "unified future frame" during the inference phase to improve efficiency.

Use spatio - temporal visual CoT for planning

Combine "ordinary future frames (temporal evolution) + red lane lines/3D boxes (spatial structure)" into a unified image intermediary QCoT, which is directly used as an intermediate reasoning step and input into the planning head. The model completes the transfer of the causal chain in the visual domain, significantly reducing semantic loss and ambiguity caused by symbolization.
Expression: Autoregressively generate the future trajectory Wt based on It and QCoT, compatible with navigation instructions and the state of the ego - vehicle (optional).

Training strategy

Initialization: Start from any existing MLLM (such as Qwen2 - VL - 2B, LLaVA - 7B); Freeze the visual encoder and fine - tune the LLM main body.
Phase one (unified pre - training): Conduct mixed training on VQA, future frame generation, and progressive perception generation (lane lines/3D boxes), and use a large amount of unlabeled nuScenes videos for future frame prediction.
Phase two (SFT): Jointly optimize scene understanding (DriveLM GVQA) and trajectory planning (nuScenes, including unified spatio - temporal CoT as an intermediate step), and call task - specific reasoning through different prompt words.
Implementation points: Incorporate the MoVQGAN visual codebook into the vocabulary and restore pixels by the detokenizer; Pre - train for 32 epochs and SFT for 12 epochs; Only fully fine - tune the LLM.

Experiment

End - to - end trajectory planning

Compared with Doe - 1 (Lumina - mGPT - 7B) which also has visual generation ability, FSDrive achieves lower L2 and lower collision rates without using the ego - vehicle state:

Average L2 in ST - P3: 0.53 vs 0.70; Collision rate: 0.19 vs 0.21 (based on Qwen2 - VL - 2B).
Average L2 in UniAD: 0.96 vs 1.26; Collision rate: 0.40 vs 0.53.

Compared with recent methods under the LLaVA - 7B series (such as OminiDrive, RDA - Driver), FSDrive shows strong competitiveness under the same settings, indicating that the framework can be widely migrated to mainstream MLLMs.

Quality of future frame generation (FID)

At a resolution of 128×192, FSDrive (autoregressive) has an FID = 10.1, which is better than most diffusion - based world models (such as GEM with 10.5) and significantly better than Doe - 1 (15.9), balancing real - time performance and quality.

Scene understanding (DriveLM GVQA)

The final score is 0.57, exceeding OminiDrive (0.56), Cube - LLM, etc.; Multiple language generation indicators and multiple - choice accuracy (0.72) all show robustness, indicating the effectiveness of the unified pre - training of "understanding + generation".

Qualitative analysis

Under wrong navigation instructions, FSDrive can correct the path through visual reasoning of "observation + future prediction" and reduce potential collisions; It reflects its "inverse dynamics" ability and interpretability.

Conclusion

This paper proposes FSDrive: Use "unified spatio - temporal visual CoT" as intermediate reasoning to connect the visual expression of future scenario prediction and perception results, enabling VLA to complete causal reasoning and trajectory planning in the visual domain.

The method does not need to modify the original MLLM structure and can activate image generation by expanding the vocabulary and autoregressive training; Combined with the "from easy to difficult" progressive visual CoT, it explicitly injects physical constraints and improves the authenticity and consistency of future predictions.

System verification in the three major tasks of planning, generation, and understanding shows that FSDrive achieves strong competitiveness or even SOTA performance in open - loop scenarios at a lower data/computing cost and significantly reduces collision risks, promoting autonomous driving from "symbolic reasoning" to "visual reasoning".

Limitations and prospects: Currently, mainly front - view future frames are generated for real - time considerations, and future work can expand to unified surround - view prediction; At the same time, with the implementation of the model, ethical compliance issues such as safety, privacy, and supervision should be emphasized to ensure the technology is beneficial and reliably deployed.

This article is from the WeChat official account "Machine Intelligence", published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

FSDrive vereinheitlicht VLA und Weltmodell und treibt die automatische Fahrweise in Richtung visueller Inferenz voran.

Method

Experiment

Conclusion