Train 100 Million Gaussian Points on a Single Card and Reconstruct 25-Square-Kilometer City: CPU "Add

RTX4090 + 128GB memory = Reconstruct a 25-square-kilometer city

Want to reconstruct a city using 3D Gaussian Splatting (3DGS)?

In the past, this often meant using an expensive GPU cluster. Now, researchers have provided another answer: a single RTX 4090, combined with sufficient CPU memory, can also accomplish city-scale 3D reconstruction.

A research team from New York University presented a system named CLM (CPU-offloaded Large-scale 3DGS training) at ASPLOS 2026. By transferring the parameters that consume the most GPU memory during 3DGS training to CPU memory, this work enables a single consumer-grade graphics card to train Gaussian point models with hundreds of millions of points, significantly lowering the hardware threshold for large-scale neural rendering.

Challenges in Large-scale Application of 3DGS

3D Gaussian Splatting (3DGS) has become an important technical approach in the field of neural rendering due to its high-quality rendering results and extremely fast rendering speed. However, when researchers attempted to apply it to complex scenarios such as urban blocks and large indoor spaces, problems quickly emerged - GPU memory became the most direct and difficult bottleneck to overcome.

A high-precision 3DGS model typically contains tens of millions or even hundreds of millions of Gaussian points. Each Gaussian point includes dozens of learnable parameters such as position, shape, color, and opacity, and gradients and optimizer states also need to be saved during the training process. Researchers pointed out that even a graphics card like the RTX 4090 with 24GB of memory can only accommodate the complete training state of about 10 - 20 million Gaussian points, which is far from sufficient to cover city-scale scenarios.

Previously, methods for scaling up were not ideal: either use multiple GPUs for parallel training, which is costly; or reduce the number of Gaussians through compression, cropping, or partitioned training, often at the expense of reconstruction quality.

Most Gaussians are Idle in GPU Memory

The inspiration for CLM comes from a systematic observation of the training process.

Researchers found that in each view rendering during 3DGS training, only a tiny fraction of the Gaussian points in the entire scene are actually involved in the computation. In large-scale scenarios, a single frame of an image usually only accesses less than 1% of the Gaussian points, and most of the remaining parameters are not used in this step of training.

Based on this phenomenon, they proposed the design concept of CLM, which is to no longer keep all Gaussian parameters in GPU memory permanently, but dynamically load them according to the view when needed.

Solving the GPU Memory Bottleneck through System Collaboration

CLM is not simply moving data from the GPU to the CPU, but a system solution designed around CPU - GPU collaboration. Researchers summarized it into three key mechanisms.

1. Attribute Segmentation: Keep Only "Key Attributes" in the GPU

In CLM, the 59 learnable parameters of each Gaussian point are divided into two categories.

Among them, the "key attributes" used for frustum culling and visibility determination - including position, rotation, and scale (a total of 10 floating-point numbers) - are permanently stored in GPU memory. This part of the data accounts for less than 20% of the memory occupied by a single Gaussian, which is sufficient to determine whether the Gaussian will be used in the current view.

The remaining approximately 80% of "non-key attributes", such as spherical harmonic coefficients, opacity, and their optimizer states, are offloaded to the larger-capacity CPU memory and are only loaded into the GPU when needed.

2. Pre-rendering Frustum Culling and Selective Loading

Unlike traditional 3DGS, which integrates frustum culling logic into the rendering kernel, CLM explicitly calculates the indices of visible Gaussian points in the current view before rendering.

The system first uses the key attributes resident in the GPU to perform fast frustum culling, and then only loads the complete parameters of these visible Gaussian points from the CPU memory, and then hands them over to the GPU for rendering and backpropagation. This approach significantly reduces the GPU's invalid computation and memory usage for invisible Gaussians.

This change transforms the problem from "buying larger GPU memory" to "fully utilizing the existing CPU memory".

It is worth noting that the "pre-rendering frustum culling" technology included in the implementation of CLM is also an independent optimization. Traditional 3DGS integrates frustum culling with the rendering kernel, resulting in the GPU threads performing invalid computations on a large number of Gaussian points outside the frustum. CLM instead explicitly calculates the indices of Gaussian points within the frustum before rendering and only inputs these points into the rendering kernel, thereby reducing the GPU's computational load and memory usage. This technology can also be applied to GPU-only training without offloading and bring performance improvements.

3. How to Make the CPU Help without Slowing Down?

The most common problem caused by CPU participation in training is that frequent data transfer slows down the overall speed.

CLM mitigates this risk through a multi-layer system design:

1. Micro-batch Pipeline: Split a training batch into multiple micro-batches (usually each micro-batch corresponds to an image). Through double buffering and asynchronous execution, overlap the parameter loading of micro-batch i + 1 with the GPU backpropagation of micro-batch i, and overlap the gradient storage of micro-batch i with the GPU forward propagation of micro-batch i + 1. This design makes the active memory usage independent of the batch size and effectively hides the communication latency.

2. Caching Mechanism: Utilize the spatial locality between consecutive views to cache repeatedly used Gaussian points and avoid repeatedly loading the same data from the CPU.

3. Intelligent Scheduling: The research team even modeled the rendering order as a "Traveling Salesman Problem" (TSP) and used an algorithm to find the view arrangement with the highest Gaussian point reuse rate, thereby maximizing cache hits and minimizing data transfer.

Through this series of designs, the CPU is no longer just an auxiliary "slow warehouse" but becomes a computational resource that can collaborate efficiently with the GPU.

Actual Test Results: Single RTX 4090, 6.7-fold Scale Increase and Simultaneous Quality Improvement

How effective is it? The experimental data in the paper provides strong evidence:

Scale Breakthrough: CLM technology can significantly increase the model size in almost any scenario.

On the "MatrixCity BigCity" aerial dataset covering an area of 25.3 square kilometers, the traditional GPU-only method can only train a maximum of 15.3 million Gaussian points on the RTX 4090 (otherwise, the GPU memory will overflow). However, CLM successfully trained 102.2 million Gaussian points using CPU memory, increasing the model scale by 6.7 times, which is 2.2 times larger than when only using the offloading function on the RTX 4090 graphics card.

Quality Improvement: More parameters lead to more accurate reconstruction. The PSNR (Peak Signal-to-Noise Ratio) of the 102.2 million Gaussian point model reaches 25.15dB, significantly better than the 23.93dB of the 15.3 million point model.

Controllable Speed: Despite the communication overhead, thanks to the carefully designed overlapping computation, the training throughput of CLM on the RTX 4090 can still reach 55% to 90% of the enhanced baseline throughput. On the slower RTX 2080 Ti, since the GPU computation time can better mask the communication latency, the throughput can even reach 86% to 97% of the baseline.

High Versatility: This solution is independent of specific backend rendering engines (gsplat, inria - 3dgs, etc.) and can be extended to other splatting algorithm fine-tuning methods (2DGS, mesh-splatting).

"Reducing Costs and Increasing Efficiency" for Large-scale 3D Reconstruction

From a research perspective, CLM is a system engineering research directly targeting real deployment bottlenecks. Its core contribution lies in systematically incorporating CPU memory and computational resources into the resource allocation system of 3DGS training for the first time. Without relying on multi-GPU clusters, it provides a cost-effective and feasible path for academia and industry to perform ultra-large-scale scene reconstruction.

From an industrial perspective, with the increasing demand for applications such as digital twins and large-scale map reconstruction, there is an urgent need for efficient and low-cost 3D reconstruction tools. The ability to stably scale up under real hardware conditions is very beneficial for the development of related work. CLM reorganizes existing computational resources through software-hardware collaboration, demonstrating a possible direction for promoting the practical application of 3DGS without increasing the investment in dedicated hardware.

Currently, the code for this project has been open-sourced on GitHub, and a complete tutorial from quick start to extreme stress testing is provided. Author Introduction: He Xu Zhao, a Ph.D. student at the Courant Institute of Mathematical Sciences, New York University, is committed to research on machine learning systems. He graduated from the Yao Class of Tsinghua University in 2023; Xiwen Min, a master's student at the Courant Institute of Mathematical Sciences, New York University, graduated from Shanghai Jiao Tong University in 2023 (for information on other authors, please refer to the paper).

Project Mentors: Professor Jinyang Li and Professor Aurojit Panda

Paper Link: https://arxiv.org/abs/2511.04951

Project Homepage: https://tarzanzhao.github.io/CLM-GS

Code Repository: https://github.com/nyu-systems/CLM-GS

This article is from the WeChat official account "QbitAI", author: Fei Yang. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Training 100 million Gaussian points with a single card and reconstructing a 25-square-kilometer city: The memory wall of 3DGS is broken by the CPU "add-on".