Open-Set 3D Semantic Instance Maps for Vision Language Navigation -- O3D-SIM

Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work SI Maps [1] showed that having instance-level information and the semantic understanding of an environment helps significantly improv...

Full description

Saved in:
Bibliographic Details
Published inarXiv.org
Main Authors Nanwani, Laksh, Gupta, Kumaraditya, Mathur, Aditya, Agrawal, Swayam, Abdul Hafez, A H, K Madhava Krishna
Format Paper Journal Article
LanguageEnglish
Published Ithaca Cornell University Library, arXiv.org 27.04.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work SI Maps [1] showed that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks. We extend this instance-level approach to 3D while increasing the pipeline's robustness and improving quantitative and qualitative results. Our method leverages foundational models for object recognition, image segmentation, and feature extraction. We propose a representation that results in a 3D point cloud map with instance-level embeddings, which bring in the semantic understanding that natural language commands can query. Quantitatively, the work improves upon the success rate of language-guided tasks. At the same time, we qualitatively observe the ability to identify instances more clearly and leverage the foundational models and language and image-aligned embeddings to identify objects that, otherwise, a closed-set approach wouldn't be able to identify.
ISSN:2331-8422
DOI:10.48550/arxiv.2404.17922