InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring

Compared with the visual grounding on 2D images, the natural-language-guided 3D object localization on point clouds is more challenging. In this paper, we propose a new model, named InstanceRefer 1 , to achieve a superior 3D visual grounding through the grounding-by-matching strategy. In practice, o...

Full description

Saved in:

Bibliographic Details
Published in	2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 1771 - 1780
Main Authors	Yuan, Zhihao, Yan, Xu, Liao, Yinghong, Zhang, Ruimao, Wang, Sheng, Li, Zhen, Cui, Shuguang
Format	Conference Proceeding
Language	English
Published	IEEE 01.01.2021
Subjects	Detection and localization in 2D and 3D Grounding Location awareness Point cloud compression Predictive models Scene analysis and understanding Solid modeling Three-dimensional displays Vision + language Visual reasoning and logical representation Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Compared with the visual grounding on 2D images, the natural-language-guided 3D object localization on point clouds is more challenging. In this paper, we propose a new model, named InstanceRefer 1 , to achieve a superior 3D visual grounding through the grounding-by-matching strategy. In practice, our model first predicts the target category from the language descriptions using a simple language classification model. Then, based on the category, our model sifts out a small number of instance candidates (usually less than 20) from the panoptic segmentation on point clouds. Thus, the non-trivial 3D visual grounding task has been effectively re-formulated as a simplified instance-matching problem, considering that instance-level candidates are more rational than the redundant 3D object proposals. Subsequently, for each candidate, we perform the multi-level contextual inference, i.e., referring from instance attribute perception, instance-to-instance relation perception, and instance-to-background global localization perception, respectively. Eventually, the most relevant candidate is selected and localized by ranking confidence scores, which are obtained by the cooperative holistic visual-language feature matching. Experiments confirm that our method outperforms previous state-of-the-arts on ScanRefer online benchmark and Nr3D/Sr3D datasets.
ISSN:	2380-7504
DOI:	10.1109/ICCV48922.2021.00181