From Captions to Pixels: Open-Set Semantic Segmentation without Masks

This paper presents a novel approach to open-set semantic segmentation in unstructured environments where there are no meaningful prior mask proposals. Our method leverages pre-trained encoders from foundation models and uses image-caption datasets for training, reducing the need for annotated masks...

Full description

Saved in:

Bibliographic Details
Published in	Baltic Journal of Modern Computing Vol. 12; no. 1; pp. 97 - 109
Main Authors	Barzdins, Paulis, Pretkalnins, Ingus, Barzdins, Guntis
Format	Journal Article
Language	English
Published	Riga University of Latvia 01.01.2024
Subjects	Annotations Coders Datasets Decision making Image segmentation Language Masks Robotics Robots Semantic segmentation Semantics Vision systems
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This paper presents a novel approach to open-set semantic segmentation in unstructured environments where there are no meaningful prior mask proposals. Our method leverages pre-trained encoders from foundation models and uses image-caption datasets for training, reducing the need for annotated masks and extensive computational resources. We introduce a novel contrastive loss function, named CLIC (Contrastive Loss function on Image-Caption data), which enables training a semantic segmentation model directly on an image-caption dataset. By utilising image-caption datasets, our method provides a practical solution for semantic segmentation in scenarios where large-scale segmented mask datasets are not readily available, as is the case for unstructured environments where full segmentation is unfeasible. Our approach is adaptable to evolving foundation models, as the encoders are used as black-boxes. The proposed method has been designed with robotics applications in mind to enhance their autonomy and decision-making capabilities in real-world scenarios.
ISSN:	2255-8942 2255-8950
DOI:	10.22364/bjmc.2024.12.L06