Meta stellt VLM³ vor: Tiefenschätzung erreicht 0,9 - Visuelle Modelle lernen von Natur aus 3D! Qwen3 - VL - 4B ermöglicht einheitliche Modellierung für mehrere Aufgaben

In vielen Aufgaben kann es mit visuellen Expertenmodellen mithalten oder diese sogar übertreffen

Meta and Princeton University have jointly proposed VLM³. Based on the standard vision - language model, through unified data organization and training methods, unified modeling of four types of tasks is achieved, namely three - dimensional object understanding, metric depth estimation, pixel matching, and camera pose calculation. In addition, the performance level of the standard VLM in fine - grained three - dimensional perception is systematically evaluated.

Three - dimensional spatial perception is the core ability in fields such as autonomous driving, robotics, and three - dimensional reconstruction. Its goal is to restore the spatial structure, scale information, and geometric relationships of the real world from two - dimensional images. Compared with two - dimensional visual tasks such as image classification and object recognition three - dimensional perception not only requires semantic understanding but also accurate spatial inference and geometric modeling. Therefore, it has long been regarded as one of the most challenging research directions in the field of computer vision.

In recent years, vision - language models (VLMs) have achieved remarkable progress in two - dimensional tasks such as classification, recognition, and segmentation thanks to their unified architecture and large - scale pre - training. However, the performance of the standard VLM in fine - grained tasks such as depth estimation, pixel matching, and camera pose calculation, which require accurate spatial inference, still lags behind specialized three - dimensional models. Currently there are no general - purpose basic models in the field of three - dimensional vision as there are in two - dimensional vision. The common methods still rely on expert models tailored to specific tasks, including special network structures, loss functions, and training strategies.

Recent studies have shown that the standard VLM already has a certain pixel - based depth perception ability without special three - dimensional adaptations. This finding suggests that the general vision - language model may have a stronger three - dimensional representation ability than expected. This leads to a question that should be thoroughly investigated: Can the standard VLM handle a wider range of fine - grained three - dimensional perception tasks without additional encoders, visual cues, or task - specific modules?

To answer this question, Meta and Princeton University have jointly proposed the VLM³ (VLM Cubed) framework. This study is based on the standard vision - language model and achieves unified modeling of four types of tasks through unified data organization and training methods, namely three - dimensional object understanding, metric depth estimation, pixel matching, and camera pose calculation. In addition, the performance level of the standard VLM in fine - grained three - dimensional perception is systematically evaluated.

The results of this study were published on the preprint platform arXiv under the title "VLM3: Vision Language Models Are Native 3D Learners".

Highlights of the study:

* On the SpatialRGPT evaluation basis, VLM³ - 4B with a more compact architecture outperforms the larger - parameterized SpatialRGPT - 8B without an additional encoder.

* Compared with the previous best vision - language model DepthLM - 7B, VLM³ - 4B improves the average accuracy δ₁ from 0.84 to 0.90 and achieves the performance of the specialized depth estimation model UnidepthV2.

* VLM³ reduces the endpoint error (EPE) of the baseline vision - language model by an order of magnitude and performs better than classic expert models such as DKM and RoMa.

* VLM³ improves the AUC₃₀° index from an almost random level (5%) to 94%, outperforms VGGT, and reaches a level comparable to DA3 - Giant.

View the paper: https://hyper.ai/papers/2605.30561

Hybrid dataset for multi - task - oriented three - dimensional perception

Three - dimensional perception tasks involve various factors such as scene scale, perspective change, camera parameters, and geometric relationships, which place high requirements on the quality and coverage of training data. To support the learning of a unified three - dimensional representation ability, this study has created a hybrid data system that covers single - and multi - view scenes and includes a total of three types of tasks: metric depth estimation, three - dimensional object understanding, and pixel matching and camera pose estimation.

In metric depth estimation the researchers use a large - scale hybrid dataset from different scenes. The basic data comes from DepthLM and includes common three - dimensional scene data such as Argoverse2, Waymo, NuScenes, ScanNet++, Taskonomy, HM3D, Matterport3D. In addition, 10 million self - created outdoor street images are added, expanding the training scope from 16 million to 26 million images. A total of about 32 million images and 320 million depth annotation points are used for training the final model, covering different scenes such as indoor spaces, outdoor areas, street views, and complex open environments.

In contrast to existing work, VLM³ does not use a uniform sampling strategy but designs different training weights based on the dataset size, learning difficulty, and generalization value. Experiments show that small datasets are more likely to lead to overfitting during hybrid training. Simply increasing the data sources does not necessarily lead to performance improvement. Therefore, the research team reduces the training weights of some small datasets to improve the general generalization ability.

For the task of three - dimensional object understanding, the standard dataset of SpatialRGPT is fully used, which contains about 1 million training images and associated qualitative and quantitative questions and answers. This dataset has now become an important evaluation standard for the current task of three - dimensional object understanding. Many images in this dataset lack camera intrinsic information, which is closer to the actual application environment and can therefore more realistically reflect the spatial inference ability of the model.

For the tasks of pixel matching and camera pose estimation, the research team has created a unified multi - view training dataset. This dataset integrates 14 common data sources such as BlendedMVS, DynamicReplica, SailVOS3D, ScanNet++ and includes a total of about 9.9 million image pairs. To ensure the training quality, the researchers only keep the samples where the visible overlap between the images is more than 25%. At the same time, 30 independent scenes from ScanNet++ are reserved as a separate test set to avoid data leakage between the training and test sets. The weighting of the dataset is based on the number of original image pairs of each data source, which further improves the stability and adaptability of the training process.

The VLM³ model: Unified three - dimensional learning under the principle of minimal changes

The goal of VLM³ development is not to create a new three - dimensional visual architecture but to evaluate the potential performance limits of the standard vision - language model in fine - grained three - dimensional tasks without changing the original structure of the model. Therefore, the entire framework follows the "principle of minimal changes" and does not introduce additional encoders, specific loss functions, or task - specific modules. Instead, the optimizations are mainly made in three aspects: the input representation, the spatial localization method, and the data organization strategy.

The study uses Qwen3 - VL - 4B as the base model and conducts the training throughout the process according to the standard Supervised Fine - Tuning (SFT) paradigm, which is consistent with the pre - training and fine - tuning process of existing vision - language models. This design ensures that the framework is directly compatible with the common VLM system without the need to create a special training pipeline.

Overview of VLM³

First, VLM³ proposes a unified image standardization strategy to solve the problem of inconsistent camera parameters between different data sources. The study has found that there are often significant differences in the camera intrinsic parameters between multi - dimensional datasets from different sources. Some network images even lack camera parameter information, which directly affects the model's ability to learn spatial geometric relationships. Therefore, the framework projects all input images into the standard focal length space and estimates the missing intrinsic parameters of the data using the existing single - image calibration model, to reduce the distribution shift due to differences in imaging conditions.

Second, VLM³ uses a unified text - based spatial localization paradigm. Traditional three - dimensional visual models usually rely on additional visual cues, rendering markers, or specially designed position - encoding modules for pixel - based localization. In contrast, VLM³ normalizes the image coordinates into a unified coordinate space and expresses the position relationships in text form. In this way, the model can use its original language modeling ability to handle pixel localization, area localization, and the learning of correspondence relationships between different views without the need to introduce additional visual modules. At the same time, a single image can contain multiple localization questions and answers, which significantly improves the training efficiency. In depth estimation, a single sample can provide about 10 times more supervision signals than the traditional method, while the computational cost remains almost the same.

The third core development is a sophisticated data mixing strategy. In contrast to many methods that rely on complex network structures to improve performance, VLM³ focuses the optimization on data organization. The research team has found through numerous experiments that blindly expanding the dataset size or training with uniform weights often leads to performance saturation or even deterioration. In contrast, a differentiated sampling strategy based on the dataset size and task characteristics can more effectively improve the three - dimensional representation ability of the model. Therefore, data mixing is regarded as an important part of the entire framework, not just as an auxiliary factor in the training process.

Based on these designs, VLM³ further achieves unified modeling of four types of three - dimensional tasks. In depth estimation, the supervision samples are created through text - based pixel localization; in three - dimensional object understanding, text - based coordinate frames are used instead of the special mask encoder; in pixel matching, the correspondence relationship between different views is converted into a coordinate prediction problem; in camera pose estimation, the complex geometric parameters are divided into text questions and answers such as translation distance, translation direction, and rotation angle. The tasks that were originally handled separately by different models are finally integrated into the autoregressive generation framework of the standard VLM.

Usage example of VLM³

For the first time, high - precision three - dimensional perception is achieved on multiple fine - grained three - dimensional tasks with standard vision - language models

To systematically evaluate the effectiveness of VLM³, the research team conducts experiments on four types of tasks: metric depth estimation, three - dimensional object understanding, pixel matching, and camera pose estimation and compares the results with those of general vision - language models and current common expert models.

In metric depth estimation, the study compares 9 public datasets with general VLMs and compares the performance on 5 representative basic metrics with the current best expert model, using δ₁ as the main evaluation index. The results are listed in the following table. VLM³ - 4B outperforms the previously representative method DepthLM - 7B in all aspects. The average accuracy increases from 0.84 to 0.90, and new records are set on several datasets. At the same time, its overall performance reaches the level of specialized depth estimation models such as UnidepthV2 and MoGe - 2.

Comparison of VLM³ with VLMs

In the task of three - dimensional...

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Die Genauigkeit der Tiefenschätzung erreicht 0,9. Meta stellt VLM³ vor und beweist, dass visuelle Modelle von Natur aus 3D lernen können. Auf der Grundlage von Qwen3-VL-4B wird eine einheitliche Modellierung für mehrere Aufgaben realisiert.

Hybrid dataset for multi - task - oriented three - dimensional perception

The VLM³ model: Unified three - dimensional learning under the principle of minimal changes

For the first time, high - precision three - dimensional perception is achieved on multiple fine - grained three - dimensional tasks with standard vision - language models