Neue Methode des Feedforward-3D-Gauss-Sprays: Zhejiang-Universität Team schlägt "Voxelausrichtung" zur Fusion von 2D-Informationen aus mehreren Perspektiven im 3D-Raum vor

Überwindung der beiden größten Engpässe bei der Vorwärts-Rekonstruktion

In today's era when 3D reconstruction is increasingly moving towards engineering applications, Feed-Forward 3D Gaussian Splatting (Feed-Forward 3DGS) is rapidly moving towards industrialization.

However, existing feed-forward 3DGS methods mainly adopt the "pixel-aligned" strategy, which maps each 2D pixel individually to one or more 3D Gaussians.

This approach seems intuitive, but it still faces two non-negligible "ceilings": it is difficult to precisely align 2D features in 3D space, and the number of Gaussian primitives is tightly bound to the pixel grid, making it impossible to allocate them intelligently according to the complexity of the scene.

VolSplat boldly abandons the inherent paradigm of pixel alignment and proposes a feed-forward framework of "voxel-aligned": by fusing view information in 3D space, it fundamentally breaks the deadlock, making high-quality multi-view rendering more robust, efficient, and easier to engineer.

Comparative experiments on public datasets show that VolSplat outperforms multiple pixel-aligned baselines in terms of visual quality and geometric consistency on the RealEstate10K and ScanNet (indoor) datasets. These numerical values not only indicate an improvement in visual quality but also reflect an enhancement in geometric consistency.

The core idea of VolSplat: Moving "alignment" from 2D to 3D

The existing pixel-aligned feed-forward 3DGS is facing two unavoidable pain points.

First, the problem of multi-view alignment: Matching based on 2D features has difficulty reliably solving the problem of geometric consistency between multiple views. When depth estimation is unstable, occlusion occurs, or there are view differences, it is difficult to precisely align 2D features in 3D space, often resulting in floating artifacts and geometric distortion.

Second, the limitation of Gaussian density: The generation of Gaussians is often restricted by the pixel grid and cannot be adaptively allocated according to the complexity of the scene. This often leads to insufficient representation of complex structures, while a large amount of representation capacity is consumed in flat or redundant areas.

Overall, these two points directly hinder the expansion and robust performance of feed-forward 3DGS in scenarios with dense views, complex structures, and large scenes.

To overcome these two challenges, the core idea of VolSplat is straightforward yet highly penetrating: Instead of making isolated predictions at the 2D pixel level, the 2D features from multiple views are back-projected and aggregated into a unified 3D voxel grid using the depth map predicted for each view. Aggregation, multi-scale feature fusion, and refinement (using a sparse 3D U-Net) are performed in this unified coordinate system. Finally, the Gaussian parameters are regressed only on the occupied voxels.

The effects of this paradigm are immediate and far-reaching: Inside the 3D grid, inconsistencies between views are naturally eliminated; the Gaussian density is no longer bound by the pixel grid but is dynamically allocated based on the "presence" and complexity of voxels. The direct benefits brought by this paradigm can be summarized into four points:

(1) Significantly enhanced cross-view consistency: It no longer completely relies on error-prone 2D feature matching, and information is fused in 3D space, making it more stable.

(2) On-demand allocation of Gaussian density: The number of Gaussians is dynamically allocated according to the complexity of the scene, with high density in complex structures and low density in flat areas, achieving a more refined and resource-saving representation.

(3) Stronger geometric consistency: Voxel aggregation and multi-scale refinement by the 3D U-Net effectively reduce "floating" and artifacts, making details and boundaries clearer.

(4) Easy to fuse with external 3D signals: 3D signals such as depth maps and point clouds can be naturally integrated into the voxelization process without complex projection operations.

To facilitate engineering implementation and expansion, VolSplat disassembles the overall pipeline into three clear modules: 2D feature and depth estimation, lifting and aggregation from pixels to voxels, and voxel-level feature refinement and Gaussian regression. Each module has its own responsibilities and is connected to each other, which is beneficial for step-by-step debugging and also convenient for scaling and optimization in engineering.

Step 1 - 2D Feature extraction & Depth prediction

For each input image, VolSplat uses a shared image encoder (combining convolutional and Transformer layers) to extract downsampled 2D features. A per-view cost volume is constructed based on plane-sweep to fuse information from neighboring views and regress a dense depth map for each view. This stage provides the necessary geometric priors and feature descriptions for the subsequent back-projection (lifting) from pixels to 3D points.

Step 2 - Lifting pixels to voxels and feature aggregation (Lifting + Voxelization)

Each pixel is back-projected to world coordinates according to its predicted depth to obtain a 3D point cloud with image features. Then, these points are discretized (voxelized) according to the preset voxel size. The features of the points falling into the same voxel are aggregated to obtain the initial voxel feature. This step naturally aligns features from different views in 3D space, facilitating subsequent voxel-level processing.

Step 3 - Sparse 3D refinement + Gaussian prediction

The initial voxel features are input into a sparse 3D U-Net decoder. This network predicts the correction term for each voxel in a residual form, thereby achieving multi-scale fusion of local and global geometric contexts. This residual update helps the network learn only the necessary geometric refinements rather than reconstructing all features, which is both robust and efficient.

Subsequently, the parameters of each Gaussian (position offset, covariance, opacity, and color coefficients) are regressed only on the occupied voxels. Finally, novel views are rendered using Gaussian Splatting and trained end-to-end with pixel-level and perceptual losses.

Experimental highlights: Leading in both effectiveness and generalization

In addition to the above effects, what is particularly remarkable is the zero-shot generalization ability of VolSplat across datasets. On the unseen ACID dataset, VolSplat still maintains high performance (PSNR 32.65dB), demonstrating strong generalization ability.

The qualitative results are more intuitive. At edges, details, and complex geometries, VolSplat shows fewer floating artifacts, texture misalignments, and geometric distortions. The distribution of Gaussians in 3D space is also closer to the geometric distribution of the real scene, rather than being "uniformly bound" by the pixel grid. Such effects are directly translated into a more robust and natural visual experience in actual product experiences (such as virtual house viewing and indoor roaming).

The proposal of VolSplat is not the end but provides a new research direction. It opens up new possibilities for feed-forward 3D reconstruction. In robotics and autonomous driving, it provides more stable 3D perception input. In AR/VR, it enables a smoother and more realistic rendering experience. In 3D vision research, it provides a new way to fuse multi-modal data under a unified voxel framework.

In the future, VolSplat can serve as a new exploration direction for feed-forward 3D reconstruction, providing a reference for relevant academic research and engineering applications.

Paper link: https://arxiv.org/abs/2509.19297

Project homepage: https://lhmd.top/volsplat

This article is from the WeChat official account "Quantum Bit". Author: VolSplat Team. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Neue Methode des Feedforward-3D-Gauss-Sprays: Das Team der Zhejiang-Universität hat die "Voxelausrichtung" vorgeschlagen, um 2D-Informationen aus mehreren Perspektiven direkt im dreidimensionalen Raum zu fusionieren.

The core idea of VolSplat: Moving "alignment" from 2D to 3D

Experimental highlights: Leading in both effectiveness and generalization