Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. A key factor for the recent progr...
Saved in:
Main Authors | , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
01.08.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Open-world instance-level scene understanding aims to locate and recognize
unseen object categories that are not present in the annotated dataset. This
task is challenging because the model needs to both localize novel 3D objects
and infer their semantic categories. A key factor for the recent progress in 2D
open-world perception is the availability of large-scale image-text pairs from
the Internet, which cover a wide range of vocabulary concepts. However, this
success is hard to replicate in 3D scenarios due to the scarcity of 3D-text
pairs. To address this challenge, we propose to harness pre-trained
vision-language (VL) foundation models that encode extensive knowledge from
image-text pairs to generate captions for multi-view images of 3D scenes. This
allows us to establish explicit associations between 3D shapes and
semantic-rich captions. Moreover, to enhance the fine-grained visual-semantic
representation learning from captions for object-level categorization, we
design hierarchical point-caption association methods to learn semantic-aware
embeddings that exploit the 3D geometry between 3D points and multi-view
images. In addition, to tackle the localization challenge for novel classes in
the open-world setting, we develop debiased instance localization, which
involves training object grouping modules on unlabeled data using
instance-level pseudo supervision. This significantly improves the
generalization capabilities of instance grouping and thus the ability to
accurately locate novel objects. We conduct extensive experiments on 3D
semantic, instance, and panoptic segmentation tasks, covering indoor and
outdoor scenes across three datasets. Our method outperforms baseline methods
by a significant margin in semantic segmentation (e.g. 34.5%$\sim$65.3%),
instance segmentation (e.g. 21.8%$\sim$54.0%) and panoptic segmentation (e.g.
14.7%$\sim$43.3%). Code will be available. |
---|---|
DOI: | 10.48550/arxiv.2308.00353 |