MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding

3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries, particularly with descriptions that involve multiple...

Full description

Saved in:

Bibliographic Details
Published in	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 14131 - 14140
Main Authors	Chang, Chun-Peng, Wang, Shaoxiang, Pagani, Alain, Stricker, Didier
Format	Conference Proceeding
Language	English
Published	IEEE 16.06.2024
Subjects	3D Scene Understanding 3D Visual Grounding Accuracy Computational modeling Cross-modal Understanding Grounding Solid modeling Source coding Three-dimensional displays Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries, particularly with descriptions that involve multiple anchors or are view-dependent. In response, we present the MiKASA (Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model integrates a self-attention-based scene-aware object encoder and an original multi-key-anchor technique, enhancing object recognition accuracy and the understanding of spatial relationships. Furthermore, MiKASA improves the explainability of decision-making, facilitating error diagnosis. Our model achieves the highest overall accuracy in the Referit3D challenge for both the Sr3D and Nr3D datasets, particularly excelling by a large margin in categories that require viewpoint-dependent descriptions. The source code and additional resources for this project are available on GitHub: https://github.com/dfki-av/MiKASA-3DVG
ISSN:	2575-7075
DOI:	10.1109/CVPR52733.2024.01340