ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation

Recently, foundational models such as CLIP and SAM have shown promising performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However, either CLIP-based or SAM-based ZSAS methods still suffer from non-negligible key drawbacks: 1) CLIP primarily focuses on global feature alignment across...

Full description

Saved in:

Bibliographic Details
Main Authors	Li, Shengze, Cao, Jianjian, Ye, Peng, Ding, Yuhan, Tu, Chongjun, Chen, Tao
Format	Journal Article
Language	English
Published	23.01.2024
Subjects	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
Online Access	Get full text

Cover

Loading…

Abstract	Recently, foundational models such as CLIP and SAM have shown promising performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However, either CLIP-based or SAM-based ZSAS methods still suffer from non-negligible key drawbacks: 1) CLIP primarily focuses on global feature alignment across different inputs, leading to imprecise segmentation of local anomalous parts; 2) SAM tends to generate numerous redundant masks without proper prompt constraints, resulting in complex post-processing requirements. In this work, we innovatively propose a CLIP and SAM collaboration framework called ClipSAM for ZSAS. The insight behind ClipSAM is to employ CLIP's semantic understanding capability for anomaly localization and rough segmentation, which is further used as the prompt constraints for SAM to refine the anomaly segmentation results. In details, we introduce a crucial Unified Multi-scale Cross-modal Interaction (UMCI) module for interacting language with visual features at multiple scales of CLIP to reason anomaly positions. Then, we design a novel Multi-level Mask Refinement (MMR) module, which utilizes the positional information as multi-level prompts for SAM to acquire hierarchical levels of masks and merges them. Extensive experiments validate the effectiveness of our approach, achieving the optimal segmentation performance on the MVTec-AD and VisA datasets.
AbstractList	Recently, foundational models such as CLIP and SAM have shown promising performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However, either CLIP-based or SAM-based ZSAS methods still suffer from non-negligible key drawbacks: 1) CLIP primarily focuses on global feature alignment across different inputs, leading to imprecise segmentation of local anomalous parts; 2) SAM tends to generate numerous redundant masks without proper prompt constraints, resulting in complex post-processing requirements. In this work, we innovatively propose a CLIP and SAM collaboration framework called ClipSAM for ZSAS. The insight behind ClipSAM is to employ CLIP's semantic understanding capability for anomaly localization and rough segmentation, which is further used as the prompt constraints for SAM to refine the anomaly segmentation results. In details, we introduce a crucial Unified Multi-scale Cross-modal Interaction (UMCI) module for interacting language with visual features at multiple scales of CLIP to reason anomaly positions. Then, we design a novel Multi-level Mask Refinement (MMR) module, which utilizes the positional information as multi-level prompts for SAM to acquire hierarchical levels of masks and merges them. Extensive experiments validate the effectiveness of our approach, achieving the optimal segmentation performance on the MVTec-AD and VisA datasets.
Author	Cao, Jianjian Tu, Chongjun Chen, Tao Ye, Peng Li, Shengze Ding, Yuhan
Author_xml	– sequence: 1 givenname: Shengze surname: Li fullname: Li, Shengze – sequence: 2 givenname: Jianjian surname: Cao fullname: Cao, Jianjian – sequence: 3 givenname: Peng surname: Ye fullname: Ye, Peng – sequence: 4 givenname: Yuhan surname: Ding fullname: Ding, Yuhan – sequence: 5 givenname: Chongjun surname: Tu fullname: Tu, Chongjun – sequence: 6 givenname: Tao surname: Chen fullname: Chen, Tao
BackLink	https://doi.org/10.48550/arXiv.2401.12665$$DView paper in arXiv
BookMark	eNotj81KxDAYRbPQhY4-gCvzAq1JmqTRXQmODlQc6KxmU77mRwtpMsQizts7dlxdLlwO91yji5iiQ-iOkpIrIcgD5J_xu2Sc0JIyKcUVWuswHrrm7QnrdrPFEC0-NaxTCDCkDPOYIvYp473Lqeg-04ybmCYIR9y5j8nFeZncoEsP4cvd_ucK7dbPO_1atO8vG920BchaFN4ww50fpCTAAHgNleRCcivEQJk1FMzgKsWtJMJ5RVVVPzpurILTW0VMtUL3Z-wi0h_yOEE-9n9C_SJU_QKLeEZx
ContentType	Journal Article
Copyright	http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml	– notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID	AKY GOX
DOI	10.48550/arxiv.2401.12665
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2401_12665
GroupedDBID	AKY GOX
ID	FETCH-LOGICAL-a675-fc2c4efb660a2aa47a364564d55b12dc1acbe384d605ef818379e4cd8a12680c3
IEDL.DBID	GOX
IngestDate	Wed Jan 31 12:16:21 EST 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a675-fc2c4efb660a2aa47a364564d55b12dc1acbe384d605ef818379e4cd8a12680c3
OpenAccessLink	https://arxiv.org/abs/2401.12665
ParticipantIDs	arxiv_primary_2401_12665
PublicationCentury	2000
PublicationDate	2024-01-23
PublicationDateYYYYMMDD	2024-01-23
PublicationDate_xml	– month: 01 year: 2024 text: 2024-01-23 day: 23
PublicationDecade	2020
PublicationYear	2024
Score	1.9139572
SecondaryResourceType	preprint
Snippet	Recently, foundational models such as CLIP and SAM have shown promising performance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However, either...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
Title	ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation
URI	https://arxiv.org/abs/2401.12665
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV07TwMxDLbaTiwIBKg8lYE1cJdL7nJs1YlSIRWGFqliqZzHQaX2WrUFwb_H90B0YUzixbaiz58dOwDXFqVVPkceG5FzGTnHTZo4ngba-kSHeeDKhP7wKR68yMeJmrSA_fbC4Ppr9lnPBzabW4Kb8CYkDFFtaAtRPtl6eJ7UxclqFFcj_ydHMWa1tQMS_QPYb6I71qvdcQgtXxxBP5vPVqPe8I5lRKMZUXdGK5bteoBR7Mhe_XrJR-_LLSNSvsD5Nxv5t0XTHFQcw7h_P84GvPm-gCNF4Ty3wkqfmzgOUCDKBMuKXyydUiYUzoZojY-0dEQofE64GSWpl9ZpJEV0YKMT6BTLwneBhQKVdi4QVqGUVqOKtY-sptuGJtV4Ct1K6emqnlAxLe0xrexx9v_ROewJQugynyCiC-hs1x_-khB2a64qM_8A1aN7Cg
link.rule.ids	228,230,786,891
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=ClipSAM%3A+CLIP+and+SAM+Collaboration+for+Zero-Shot+Anomaly+Segmentation&rft.au=Li%2C+Shengze&rft.au=Cao%2C+Jianjian&rft.au=Ye%2C+Peng&rft.au=Ding%2C+Yuhan&rft.date=2024-01-23&rft_id=info:doi/10.48550%2Farxiv.2401.12665&rft.externalDocID=2401_12665