CLM: Removing the GPU Memory Barrier for 3D Gaussian Splatting

1New York University, 2Pacific Northwest National Laboratory, 3University of Washington
*Equal contribution
Accepted to appear in ASPLOS 2026

CLM trains a 102-million Gaussian model on the MatrixCity BigCity Aerial Dataset under 4 hours using a single RTX 4090, achieving a PSNR of 25.

Overview

CLM is a 3DGS training system that enables large-scale reconstruction on consumer-level GPU setup (e.g., RTX 4090).

CPU-offloading

Stores parts of the model on CPU memory to overcome GPU memory limits.

Large-scale

Trains 102 million Gaussians on a single RTX 4090, enabling city-scale reconstruction that previously required multi-GPU setup.

Low-overheads

Achieves 55–97% of GPU-only training throughput by intelligently pipelining data transfers and computation.

Method: Offload to CPU

GPU-only Training

3D Gaussian Splatting is typically trained exclusively on GPUs. As shown in the diagram below, all Gaussians and their attributes are stored in GPU memory. Given a camera view to render, its selects the in-frustum Gaussians and rasterizes these gaussians onto the view.

Fundamental Limitation: While this approach offers fast training speed through GPU parallelism, it faces a critical bottleneck: limited GPU memory capacity. Large or intricate scenes can require hundreds of millions of Gaussians, which easily cause Out of Memory (OOM) on consumer-grade GPUs (e.g., 24GB on an RTX 4090).

CLM-Offloaded Training

CLM leverages CPU memory and CPU computation beyond the GPU. Specifically, it only stores some selection-critical attributes (e.g., position and shape) on GPU while offloading the rest non-critical attributes to CPU memory. During the training, CLM uses these selection-critical attributes to determine which Gaussians are visible and thus selected for rendering each view (i.e., those within view frustum). CLM only loads non-critical attributes of these in-frustum Gaussians into GPU memory, before performing the actual rendering. CLM does Adam for non-critical attributes on CPU.

The diagram below illustrates CLM's offloading pipeline during training:

Evaluation

Scale: How many Gaussians can be trained without OOM?

CLM consistently enables significantly larger models across all scenes. The BigCity scene shows the most notable improvement—6.7x larger than the baseline and 2.2x over naive offloading on RTX 4090.

Maximum Model Size Comparison

Speed: How fast can we train?

For small Gaussian models that can fit in GPU memory, CLM achieves 55% (Ithaca) to 90% (Bicycle) of an Enhanced baseline’s throughput on the RTX 4090.

Speed Comparison

Scalability: Does PSNR improve with larger models?

CLM enables training models with 102.2 million Gaussians, achieving a PSNR of 25.15 dB on the BigCity scene. In contrast, the GPU-only baseline is limited to just 15.3 million Gaussians and yields a lower PSNR of 23.93 dB. By supporting 6.7× larger models, CLM demonstrates that quality continues to improve with scale.

BigCity Scalability Comparison

BibTeX

@inproceedings{zhao2025clm,
      title={CLM: Removing the GPU Memory Barrier for 3D Gaussian Splatting},
      author={Hexu Zhao and Xiwen Min and Xiaoteng Liu and Moonjun Gong and Yiming Li and Ang Li and Saining Xie and Jinyang Li and Aurojit Panda},
      booktitle={Proceedings of the 2026 International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'26)},
      year={2026},
      address={Pittsburgh, PA, USA},
      url={https://arxiv.org/abs/2511.04951}
    }