Adapting Segment Anything Model for Change Detection in VHR Remote Sensing Images

Vision Foundation Models (VFMs) such as the Segment Anything Model (SAM) allow zero-shot or interactive segmentation of visual contents, thus they are quickly applied in a variety of visual scenes. However, their direct use in many Remote Sensing (RS) applications is often unsatisfactory due to the...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on geoscience and remote sensing Vol. 62; p. 1
Main Authors	Ding, Lei, Zhu, Kun, Peng, Daifeng, Tang, Hao, Yang, Kuiwu, Bruzzone, Lorenzo
Format	Journal Article
Language	English
Published	New York IEEE 01.01.2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Adaptation models Change Detection Computational modeling Convolutional Neural Network Detection Feature extraction Image segmentation Information processing Learning Remote Sensing Representations Segment Anything Model Segments Semantics Task analysis Vision Foundation Models Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Vision Foundation Models (VFMs) such as the Segment Anything Model (SAM) allow zero-shot or interactive segmentation of visual contents, thus they are quickly applied in a variety of visual scenes. However, their direct use in many Remote Sensing (RS) applications is often unsatisfactory due to the special imaging properties of RS images. In this work, we aim to utilize the strong visual recognition capabilities of VFMs to improve change detection (CD) in very high-resolution (VHR) remote sensing images (RSIs). We employ the visual encoder of FastSAM, a variant of the SAM, to extract visual representations in RS scenes. To adapt FastSAM to focus on some specific ground objects in RS scenes, we propose a convolutional adaptor to aggregate the task-oriented change information. Moreover, to utilize the semantic representations that are inherent to SAM features, we introduce a task-agnostic semantic learning branch to model the semantic latent in bi-temporal RSIs. The resulting method, SAM-CD, obtains superior accuracy compared to the SOTA fully-supervised CD methods and exhibits a sample-efficient learning ability that is comparable to semi-supervised CD methods. To the best of our knowledge, this is the first work that adapts VFMs to CD in VHR RS images.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0196-2892 1558-0644
DOI:	10.1109/TGRS.2024.3368168