3D open-vocabulary scene understanding, which accurately perceives complex semantic properties of objects in space, has gained significant attention in recent years.
In this paper, we propose GAGS, a framework that distills 2D CLIP features into 3D Gaussian splatting, enabling open-vocabulary queries for renderings on arbitrary viewpoints. The main challenge of distilling 2D features for 3D fields lies in the multiview inconsistency of extracted 2D features, which provides unstable supervision for the 3D feature field.
GAGS addresses this challenge with two novel strategies. First, GAGS associates the prompt point density of SAM with the camera distances, which significantly improves the multiview consistency of segmentation results. Second, GAGS further decodes a granularity factor to guide the distillation process and this granularity factor can be learned in a unsupervised manner to only select the multiview consistent 2D features in the distillation process. Experimental results on two datasets demonstrate significant performance and stability improvements of GAGS in visual grounding and semantic segmentation, with an inference speed 2 × faster than baseline methods.
GAGS pipeline. Given a set of images with camera poses, GAGS first uses 3D Gaussian Splatting to reconstruct the scene's geometric representation, then utilizes it for granularity-aware segmentation and CLIP feature distillation. The finally output 3D feature field supports open-vocabulary queries.
Granularity-aware segmentation. For each input image, our method calculates the number of prompt points for each patch, converts the local Gaussian density into a discrete probability distribution to guide prompt point sampling, and further directs SAM to generate multi-view consistent masks.
Granularity-aware distillation. leveraging the inherent consistency of 3D Gaussian splatting, Our method performs granularity-aware feature distillation, enhancing the stability and accuracy of learned object features
Relevance visualization result. GAGS encourages the distillation of multi-view consistent CLIP features into the Gaussian field, effectively reducing noise from conflicting views and significantly improving localization and segmentation tasks.
Ablation result. The proposed Granularity-aware Segmentation (GaS) can mitigate noise issues from over-segmentation of nearby objects and low-texture areas, as well as under-segmentation of distant objects, while Granularity-aware Distillation (GaD) significantly improves segmentation accuracy by learning multi-view consistent features at the appropriate granularity for each object.
This research was inspired by several outstanding works.
Langsplat is the first to integrate multi-level language features into 3D Gaussian representations, advancing multi-scale language understanding.
Feature 3DGS and gsplat developed accessible 3D Gaussian rendering frameworks, significantly simplifying the representation and rendering of 3D language features in scenes.
@article{peng2024gags,
title={GAGS: Granularity-Aware 3D Feature Distillation for Gaussian Splatting},
author={Peng, Yuning and Wang, Haiping and Liu, Yuan and Wen, Chenglu and Dong, Zhen and Yang, Bisheng},
journal={arXiv preprint arXiv:2412.13654},
year={2024}
}