Cross Language Image Matching for Weakly Supervised Semantic Segmentation
It has been widely known that CAM (Class Activation Map) usually only activates discriminative object regions and falsely includes lots of object-related backgrounds. As only a fixed set of image-level object labels are available to the WSSS (weakly supervised semantic segmentation) model, it could...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
05.03.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | It has been widely known that CAM (Class Activation Map) usually only
activates discriminative object regions and falsely includes lots of
object-related backgrounds. As only a fixed set of image-level object labels
are available to the WSSS (weakly supervised semantic segmentation) model, it
could be very difficult to suppress those diverse background regions consisting
of open set objects. In this paper, we propose a novel Cross Language Image
Matching (CLIMS) framework, based on the recently introduced Contrastive
Language-Image Pre-training (CLIP) model, for WSSS. The core idea of our
framework is to introduce natural language supervision to activate more
complete object regions and suppress closely-related open background regions.
In particular, we design object, background region and text label matching
losses to guide the model to excite more reasonable object regions for CAM of
each category. In addition, we design a co-occurring background suppression
loss to prevent the model from activating closely-related background regions,
with a predefined set of class-related background text descriptions. These
designs enable the proposed CLIMS to generate a more complete and compact
activation map for the target objects. Extensive experiments on PASCAL VOC2012
dataset show that our CLIMS significantly outperforms the previous
state-of-the-art methods. |
---|---|
DOI: | 10.48550/arxiv.2203.02668 |