GAGS: Granularity-Aware 3D Feature Distillation for Gaussian Splatting

Abstract

3D open-vocabulary scene understanding, which accurately perceives complex semantic properties of objects in space, has gained significant attention in recent years.

In this paper, we propose GAGS, a framework that distills 2D CLIP features into 3D Gaussian splatting, enabling open-vocabulary queries for renderings on arbitrary viewpoints. The main challenge of distilling 2D features for 3D fields lies in the multiview inconsistency of extracted 2D features, which provides unstable supervision for the 3D feature field.

GAGS addresses this challenge with two novel strategies. First, GAGS associates the prompt point density of SAM with the camera distances, which significantly improves the multiview consistency of segmentation results. Second, GAGS further decodes a granularity factor to guide the distillation process and this granularity factor can be learned in a unsupervised manner to only select the multiview consistent 2D features in the distillation process. Experimental results on two datasets demonstrate significant performance and stability improvements of GAGS in visual grounding and semantic segmentation, with an inference speed 2 × faster than baseline methods.

Pipeline

GAGS pipeline. Given a set of images with camera poses, GAGS first uses 3D Gaussian Splatting to reconstruct the scene's geometric representation, then utilizes it for granularity-aware segmentation and CLIP feature distillation. The finally output 3D feature field supports open-vocabulary queries.

Granularity-aware segmentation. For each input image, our method calculates the number of prompt points for each patch, converts the local Gaussian density into a discrete probability distribution to guide prompt point sampling, and further directs SAM to generate multi-view consistent masks.

Granularity-aware distillation. leveraging the inherent consistency of 3D Gaussian splatting, Our method performs granularity-aware feature distillation, enhancing the stability and accuracy of learned object features

Visualization

Relevance visualization result. GAGS encourages the distillation of multi-view consistent CLIP features into the Gaussian field, effectively reducing noise from conflicting views and significantly improving localization and segmentation tasks.

Ablation result. The proposed Granularity-aware Segmentation (GaS) can mitigate noise issues from over-segmentation of nearby objects and low-texture areas, as well as under-segmentation of distant objects, while Granularity-aware Distillation (GaD) significantly improves segmentation accuracy by learning multi-view consistent features at the appropriate granularity for each object.

BibTeX

@article{peng2024gags,
      title={GAGS: Granularity-Aware 3D Feature Distillation for Gaussian Splatting},
      author={Peng, Yuning and Wang, Haiping and Liu, Yuan and Wen, Chenglu and Dong, Zhen and Yang, Bisheng},
      journal={arXiv preprint arXiv:2412.13654},
      year={2024}
}

GAGS: Granularity-Aware 3D Feature Distillation for Gaussian Splatting

GAGS learns a 3D Gaussian field associated with semantic features, which enables accurate open-vocabulary 3D visual grounding in the scene.

" sun umbrella "

" wooden table "

" flowerpot with dried flowers "

" Settlers Catan " (name of book)

" purple notebook "

" pink flower on the tree "

" LEGO base "

" LEGO bonsai "

" bag of fusilli "

" knife "

" Pringles "

" wooden bowl "

Abstract

Pipeline

Visualization

More Results

Related Links

BibTeX