Visual and Textual Prior Guided Mask Assemble for Few-Shot Segmentation and Beyond

Few-shot segmentation (FSS) aims to segment the novel class with a few annotated images. Due to CLIP's advantages of aligning visual and textual information, the integration of CLIP can enhance the generalization ability of FSS model. However, even with the CLIP model, the existing CLIP-based F...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on multimedia Vol. 26; pp. 7197 - 7209
Main Authors Chen, Shuai, Meng, Fanman, Zhang, Runtong, Qiu, Heqian, Li, Hongliang, Wu, Qingbo, Xu, Linfeng
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Few-shot segmentation (FSS) aims to segment the novel class with a few annotated images. Due to CLIP's advantages of aligning visual and textual information, the integration of CLIP can enhance the generalization ability of FSS model. However, even with the CLIP model, the existing CLIP-based FSS methods are still subject to the biased prediction towards base class, which is caused by the class-specific feature level interactions. To solve this issue, we propose a visual and textual Prior Guided Mask Assemble Network (PGMA-Net). It employs a class-agnostic mask assembly process to alleviate the bias, and formulates diverse tasks into a unified manner by assembling the prior through affinity. Specifically, the class-relevant textual and visual features are first transformed to class-agnostic prior in the form of probability map. Then, a Prior-Guided Mask Assemble Module (PGMAM) including multiple General Assemble Units (GAUs) is introduced. It considers diverse and plug-and-play interactions, such as visual-textual, inter- and intra-image, training-free, and high-order ones. Lastly, to ensure the class-agnostic ability, a Hierarchical Decoder with Channel-Drop Mechanism (HDCDM) is proposed to flexibly exploit the assembled masks and low-level features, without relying on any class-specific information. It achieves new state-of-the-art results in the FSS task, with mIoU of 77.6 on <inline-formula><tex-math notation="LaTeX">\text{PASCAL-}5^{i}</tex-math></inline-formula> and 59.4 on <inline-formula><tex-math notation="LaTeX">\text{COCO-}20^{i}</tex-math></inline-formula> in 1-shot scenario. Beyond this, we show that without extra re-training, the proposed PGMA-Net can solve bbox-level and cross-domain FSS, co-segmentation, zero-shot segmentation (ZSS) tasks, leading an any-shot segmentation framework capable of accommodating diverse weak or pixel annotations.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2024.3361181