ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation

Recently, foundational models such as CLIP and SAM have shown promising performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However, either CLIP-based or SAM-based ZSAS methods still suffer from non-negligible key drawbacks: 1) CLIP primarily focuses on global feature alignment across...

Full description

Saved in:
Bibliographic Details
Main Authors Li, Shengze, Cao, Jianjian, Ye, Peng, Ding, Yuhan, Tu, Chongjun, Chen, Tao
Format Journal Article
LanguageEnglish
Published 23.01.2024
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Recently, foundational models such as CLIP and SAM have shown promising performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However, either CLIP-based or SAM-based ZSAS methods still suffer from non-negligible key drawbacks: 1) CLIP primarily focuses on global feature alignment across different inputs, leading to imprecise segmentation of local anomalous parts; 2) SAM tends to generate numerous redundant masks without proper prompt constraints, resulting in complex post-processing requirements. In this work, we innovatively propose a CLIP and SAM collaboration framework called ClipSAM for ZSAS. The insight behind ClipSAM is to employ CLIP's semantic understanding capability for anomaly localization and rough segmentation, which is further used as the prompt constraints for SAM to refine the anomaly segmentation results. In details, we introduce a crucial Unified Multi-scale Cross-modal Interaction (UMCI) module for interacting language with visual features at multiple scales of CLIP to reason anomaly positions. Then, we design a novel Multi-level Mask Refinement (MMR) module, which utilizes the positional information as multi-level prompts for SAM to acquire hierarchical levels of masks and merges them. Extensive experiments validate the effectiveness of our approach, achieving the optimal segmentation performance on the MVTec-AD and VisA datasets.
AbstractList Recently, foundational models such as CLIP and SAM have shown promising performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However, either CLIP-based or SAM-based ZSAS methods still suffer from non-negligible key drawbacks: 1) CLIP primarily focuses on global feature alignment across different inputs, leading to imprecise segmentation of local anomalous parts; 2) SAM tends to generate numerous redundant masks without proper prompt constraints, resulting in complex post-processing requirements. In this work, we innovatively propose a CLIP and SAM collaboration framework called ClipSAM for ZSAS. The insight behind ClipSAM is to employ CLIP's semantic understanding capability for anomaly localization and rough segmentation, which is further used as the prompt constraints for SAM to refine the anomaly segmentation results. In details, we introduce a crucial Unified Multi-scale Cross-modal Interaction (UMCI) module for interacting language with visual features at multiple scales of CLIP to reason anomaly positions. Then, we design a novel Multi-level Mask Refinement (MMR) module, which utilizes the positional information as multi-level prompts for SAM to acquire hierarchical levels of masks and merges them. Extensive experiments validate the effectiveness of our approach, achieving the optimal segmentation performance on the MVTec-AD and VisA datasets.
Author Cao, Jianjian
Tu, Chongjun
Chen, Tao
Ye, Peng
Li, Shengze
Ding, Yuhan
Author_xml – sequence: 1
  givenname: Shengze
  surname: Li
  fullname: Li, Shengze
– sequence: 2
  givenname: Jianjian
  surname: Cao
  fullname: Cao, Jianjian
– sequence: 3
  givenname: Peng
  surname: Ye
  fullname: Ye, Peng
– sequence: 4
  givenname: Yuhan
  surname: Ding
  fullname: Ding, Yuhan
– sequence: 5
  givenname: Chongjun
  surname: Tu
  fullname: Tu, Chongjun
– sequence: 6
  givenname: Tao
  surname: Chen
  fullname: Chen, Tao
BackLink https://doi.org/10.48550/arXiv.2401.12665$$DView paper in arXiv
BookMark eNotj81KxDAYRbPQhY4-gCvzAq1JmqTRXQmODlQc6KxmU77mRwtpMsQizts7dlxdLlwO91yji5iiQ-iOkpIrIcgD5J_xu2Sc0JIyKcUVWuswHrrm7QnrdrPFEC0-NaxTCDCkDPOYIvYp473Lqeg-04ybmCYIR9y5j8nFeZncoEsP4cvd_ucK7dbPO_1atO8vG920BchaFN4ww50fpCTAAHgNleRCcivEQJk1FMzgKsWtJMJ5RVVVPzpurILTW0VMtUL3Z-wi0h_yOEE-9n9C_SJU_QKLeEZx
ContentType Journal Article
Copyright http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID AKY
GOX
DOI 10.48550/arxiv.2401.12665
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2401_12665
GroupedDBID AKY
GOX
ID FETCH-LOGICAL-a675-fc2c4efb660a2aa47a364564d55b12dc1acbe384d605ef818379e4cd8a12680c3
IEDL.DBID GOX
IngestDate Wed Jan 31 12:16:21 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a675-fc2c4efb660a2aa47a364564d55b12dc1acbe384d605ef818379e4cd8a12680c3
OpenAccessLink https://arxiv.org/abs/2401.12665
ParticipantIDs arxiv_primary_2401_12665
PublicationCentury 2000
PublicationDate 2024-01-23
PublicationDateYYYYMMDD 2024-01-23
PublicationDate_xml – month: 01
  year: 2024
  text: 2024-01-23
  day: 23
PublicationDecade 2020
PublicationYear 2024
Score 1.9139572
SecondaryResourceType preprint
Snippet Recently, foundational models such as CLIP and SAM have shown promising performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However, either...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Artificial Intelligence
Computer Science - Computer Vision and Pattern Recognition
Title ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation
URI https://arxiv.org/abs/2401.12665
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV07TwMxDLbaTiwIBKg8lYE1cJdL7nJs1YlSIRWGFqliqZzHQaX2WrUFwb_H90B0YUzixbaiz58dOwDXFqVVPkceG5FzGTnHTZo4ngba-kSHeeDKhP7wKR68yMeJmrSA_fbC4Ppr9lnPBzabW4Kb8CYkDFFtaAtRPtl6eJ7UxclqFFcj_ydHMWa1tQMS_QPYb6I71qvdcQgtXxxBP5vPVqPe8I5lRKMZUXdGK5bteoBR7Mhe_XrJR-_LLSNSvsD5Nxv5t0XTHFQcw7h_P84GvPm-gCNF4Ty3wkqfmzgOUCDKBMuKXyydUiYUzoZojY-0dEQofE64GSWpl9ZpJEV0YKMT6BTLwneBhQKVdi4QVqGUVqOKtY-sptuGJtV4Ct1K6emqnlAxLe0xrexx9v_ROewJQugynyCiC-hs1x_-khB2a64qM_8A1aN7Cg
link.rule.ids 228,230,786,891
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=ClipSAM%3A+CLIP+and+SAM+Collaboration+for+Zero-Shot+Anomaly+Segmentation&rft.au=Li%2C+Shengze&rft.au=Cao%2C+Jianjian&rft.au=Ye%2C+Peng&rft.au=Ding%2C+Yuhan&rft.date=2024-01-23&rft_id=info:doi/10.48550%2Farxiv.2401.12665&rft.externalDocID=2401_12665