ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation
Recently, foundational models such as CLIP and SAM have shown promising performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However, either CLIP-based or SAM-based ZSAS methods still suffer from non-negligible key drawbacks: 1) CLIP primarily focuses on global feature alignment across...
Saved in:
Main Authors | , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
23.01.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Recently, foundational models such as CLIP and SAM have shown promising
performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However,
either CLIP-based or SAM-based ZSAS methods still suffer from non-negligible
key drawbacks: 1) CLIP primarily focuses on global feature alignment across
different inputs, leading to imprecise segmentation of local anomalous parts;
2) SAM tends to generate numerous redundant masks without proper prompt
constraints, resulting in complex post-processing requirements. In this work,
we innovatively propose a CLIP and SAM collaboration framework called ClipSAM
for ZSAS. The insight behind ClipSAM is to employ CLIP's semantic understanding
capability for anomaly localization and rough segmentation, which is further
used as the prompt constraints for SAM to refine the anomaly segmentation
results. In details, we introduce a crucial Unified Multi-scale Cross-modal
Interaction (UMCI) module for interacting language with visual features at
multiple scales of CLIP to reason anomaly positions. Then, we design a novel
Multi-level Mask Refinement (MMR) module, which utilizes the positional
information as multi-level prompts for SAM to acquire hierarchical levels of
masks and merges them. Extensive experiments validate the effectiveness of our
approach, achieving the optimal segmentation performance on the MVTec-AD and
VisA datasets. |
---|---|
AbstractList | Recently, foundational models such as CLIP and SAM have shown promising
performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However,
either CLIP-based or SAM-based ZSAS methods still suffer from non-negligible
key drawbacks: 1) CLIP primarily focuses on global feature alignment across
different inputs, leading to imprecise segmentation of local anomalous parts;
2) SAM tends to generate numerous redundant masks without proper prompt
constraints, resulting in complex post-processing requirements. In this work,
we innovatively propose a CLIP and SAM collaboration framework called ClipSAM
for ZSAS. The insight behind ClipSAM is to employ CLIP's semantic understanding
capability for anomaly localization and rough segmentation, which is further
used as the prompt constraints for SAM to refine the anomaly segmentation
results. In details, we introduce a crucial Unified Multi-scale Cross-modal
Interaction (UMCI) module for interacting language with visual features at
multiple scales of CLIP to reason anomaly positions. Then, we design a novel
Multi-level Mask Refinement (MMR) module, which utilizes the positional
information as multi-level prompts for SAM to acquire hierarchical levels of
masks and merges them. Extensive experiments validate the effectiveness of our
approach, achieving the optimal segmentation performance on the MVTec-AD and
VisA datasets. |
Author | Cao, Jianjian Tu, Chongjun Chen, Tao Ye, Peng Li, Shengze Ding, Yuhan |
Author_xml | – sequence: 1 givenname: Shengze surname: Li fullname: Li, Shengze – sequence: 2 givenname: Jianjian surname: Cao fullname: Cao, Jianjian – sequence: 3 givenname: Peng surname: Ye fullname: Ye, Peng – sequence: 4 givenname: Yuhan surname: Ding fullname: Ding, Yuhan – sequence: 5 givenname: Chongjun surname: Tu fullname: Tu, Chongjun – sequence: 6 givenname: Tao surname: Chen fullname: Chen, Tao |
BackLink | https://doi.org/10.48550/arXiv.2401.12665$$DView paper in arXiv |
BookMark | eNotj81KxDAYRbPQhY4-gCvzAq1JmqTRXQmODlQc6KxmU77mRwtpMsQizts7dlxdLlwO91yji5iiQ-iOkpIrIcgD5J_xu2Sc0JIyKcUVWuswHrrm7QnrdrPFEC0-NaxTCDCkDPOYIvYp473Lqeg-04ybmCYIR9y5j8nFeZncoEsP4cvd_ucK7dbPO_1atO8vG920BchaFN4ww50fpCTAAHgNleRCcivEQJk1FMzgKsWtJMJ5RVVVPzpurILTW0VMtUL3Z-wi0h_yOEE-9n9C_SJU_QKLeEZx |
ContentType | Journal Article |
Copyright | http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
Copyright_xml | – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
DBID | AKY GOX |
DOI | 10.48550/arxiv.2401.12665 |
DatabaseName | arXiv Computer Science arXiv.org |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository |
DeliveryMethod | fulltext_linktorsrc |
ExternalDocumentID | 2401_12665 |
GroupedDBID | AKY GOX |
ID | FETCH-LOGICAL-a675-fc2c4efb660a2aa47a364564d55b12dc1acbe384d605ef818379e4cd8a12680c3 |
IEDL.DBID | GOX |
IngestDate | Wed Jan 31 12:16:21 EST 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-a675-fc2c4efb660a2aa47a364564d55b12dc1acbe384d605ef818379e4cd8a12680c3 |
OpenAccessLink | https://arxiv.org/abs/2401.12665 |
ParticipantIDs | arxiv_primary_2401_12665 |
PublicationCentury | 2000 |
PublicationDate | 2024-01-23 |
PublicationDateYYYYMMDD | 2024-01-23 |
PublicationDate_xml | – month: 01 year: 2024 text: 2024-01-23 day: 23 |
PublicationDecade | 2020 |
PublicationYear | 2024 |
Score | 1.9139572 |
SecondaryResourceType | preprint |
Snippet | Recently, foundational models such as CLIP and SAM have shown promising
performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However,
either... |
SourceID | arxiv |
SourceType | Open Access Repository |
SubjectTerms | Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition |
Title | ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation |
URI | https://arxiv.org/abs/2401.12665 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV07TwMxDLbaTiwIBKg8lYE1cJdL7nJs1YlSIRWGFqliqZzHQaX2WrUFwb_H90B0YUzixbaiz58dOwDXFqVVPkceG5FzGTnHTZo4ngba-kSHeeDKhP7wKR68yMeJmrSA_fbC4Ppr9lnPBzabW4Kb8CYkDFFtaAtRPtl6eJ7UxclqFFcj_ydHMWa1tQMS_QPYb6I71qvdcQgtXxxBP5vPVqPe8I5lRKMZUXdGK5bteoBR7Mhe_XrJR-_LLSNSvsD5Nxv5t0XTHFQcw7h_P84GvPm-gCNF4Ty3wkqfmzgOUCDKBMuKXyydUiYUzoZojY-0dEQofE64GSWpl9ZpJEV0YKMT6BTLwneBhQKVdi4QVqGUVqOKtY-sptuGJtV4Ct1K6emqnlAxLe0xrexx9v_ROewJQugynyCiC-hs1x_-khB2a64qM_8A1aN7Cg |
link.rule.ids | 228,230,786,891 |
linkProvider | Cornell University |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=ClipSAM%3A+CLIP+and+SAM+Collaboration+for+Zero-Shot+Anomaly+Segmentation&rft.au=Li%2C+Shengze&rft.au=Cao%2C+Jianjian&rft.au=Ye%2C+Peng&rft.au=Ding%2C+Yuhan&rft.date=2024-01-23&rft_id=info:doi/10.48550%2Farxiv.2401.12665&rft.externalDocID=2401_12665 |