A Simple Framework for Text-Supervised Semantic Segmentation

Text-supervised semantic segmentation is a novel research topic that allows semantic segments to emerge with image-text contrasting. However, pioneering methods could be subject to specifically designed network architectures. This paper shows that a vanilla contrastive language-image pretraining (CL...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 7071 - 7080
Main Authors	Yi, Muyang, Cui, Quan, Wu, Hao, Yang, Cheng, Yoshie, Osamu, Lu, Hongtao
Format	Conference Proceeding
Language	English Japanese
Published	IEEE 01.06.2023
Subjects	and reasoning Codes Computer vision language Location awareness Network architecture Semantic segmentation Semantics Vision Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Text-supervised semantic segmentation is a novel research topic that allows semantic segments to emerge with image-text contrasting. However, pioneering methods could be subject to specifically designed network architectures. This paper shows that a vanilla contrastive language-image pretraining (CLIP) model is an effective text-supervised semantic segmentor by itself. First, we reveal that a vanilla CLIP is inferior to localization and segmentation due to its optimization being driven by densely aligning visual and language representations. Second, we propose the locality-driven alignment (LoDA) to address the problem, where CLIP optimization is driven by sparsely aligning local representations. Third, we propose a simple segmentation (SimSeg) framework. LoDA and SimSeg jointly amelio-rate a vanilla CLIP to produce impressive semantic segmentation results. Our method outperforms previous state-of-the-art methods on PASCAL VOC 2012, PASCAL Context and COCO datasets by large margins. Code and models are available at github.com/muyangyi/SimSeg.
ISSN:	1063-6919
DOI:	10.1109/CVPR52729.2023.00683