Open-Set 3D Semantic Instance Maps for Vision Language Navigation -- O3D-SIM

Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work SI Maps [1] showed that having instance-level information and the semantic understanding of an environment helps significantly improv...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Nanwani, Laksh, Gupta, Kumaraditya, Mathur, Aditya, Agrawal, Swayam, Abdul Hafez, A H, K Madhava Krishna
Format	Paper Journal Article
Language	English
Published	Ithaca Cornell University Library, arXiv.org 27.04.2024
Subjects	Computer Science - Computer Vision and Pattern Recognition Computer Science - Robotics Feature extraction Image segmentation Language Natural language processing Object recognition Query languages Semantics Three dimensional models
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work SI Maps [1] showed that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks. We extend this instance-level approach to 3D while increasing the pipeline's robustness and improving quantitative and qualitative results. Our method leverages foundational models for object recognition, image segmentation, and feature extraction. We propose a representation that results in a 3D point cloud map with instance-level embeddings, which bring in the semantic understanding that natural language commands can query. Quantitatively, the work improves upon the success rate of language-guided tasks. At the same time, we qualitatively observe the ability to identify instances more clearly and leverage the foundational models and language and image-aligned embeddings to identify objects that, otherwise, a closed-set approach wouldn't be able to identify.
ISSN:	2331-8422
DOI:	10.48550/arxiv.2404.17922