Visual and Textual Prior Guided Mask Assemble for Few-Shot Segmentation and Beyond
Few-shot segmentation (FSS) aims to segment the novel classes with a few annotated images. Due to CLIP's advantages of aligning visual and textual information, the integration of CLIP can enhance the generalization ability of FSS model. However, even with the CLIP model, the existing CLIP-based...
Saved in:
Main Authors | , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
14.08.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Few-shot segmentation (FSS) aims to segment the novel classes with a few
annotated images. Due to CLIP's advantages of aligning visual and textual
information, the integration of CLIP can enhance the generalization ability of
FSS model. However, even with the CLIP model, the existing CLIP-based FSS
methods are still subject to the biased prediction towards base classes, which
is caused by the class-specific feature level interactions. To solve this
issue, we propose a visual and textual Prior Guided Mask Assemble Network
(PGMA-Net). It employs a class-agnostic mask assembly process to alleviate the
bias, and formulates diverse tasks into a unified manner by assembling the
prior through affinity. Specifically, the class-relevant textual and visual
features are first transformed to class-agnostic prior in the form of
probability map. Then, a Prior-Guided Mask Assemble Module (PGMAM) including
multiple General Assemble Units (GAUs) is introduced. It considers diverse and
plug-and-play interactions, such as visual-textual, inter- and intra-image,
training-free, and high-order ones. Lastly, to ensure the class-agnostic
ability, a Hierarchical Decoder with Channel-Drop Mechanism (HDCDM) is proposed
to flexibly exploit the assembled masks and low-level features, without relying
on any class-specific information. It achieves new state-of-the-art results in
the FSS task, with mIoU of $77.6$ on $\text{PASCAL-}5^i$ and $59.4$ on
$\text{COCO-}20^i$ in 1-shot scenario. Beyond this, we show that without extra
re-training, the proposed PGMA-Net can solve bbox-level and cross-domain FSS,
co-segmentation, zero-shot segmentation (ZSS) tasks, leading an any-shot
segmentation framework. |
---|---|
DOI: | 10.48550/arxiv.2308.07539 |