A Bio-Inspired Visual Perception Transformer for Cross-Domain Semantic Segmentation of High-Resolution Remote Sensing Images

Pixel-level classification of very-high-resolution images is a crucial yet challenging task in remote sensing. While transformers have demonstrated effectiveness in capturing dependencies, their tendency to partition images into patches may restrict their applicability to highly detailed remote sens...

Full description

Saved in:

Bibliographic Details
Published in	Remote sensing (Basel, Switzerland) Vol. 16; no. 9; p. 1514
Main Authors	Wang, Xinyao, Wang, Haitao, Jing, Yuqian, Yang, Xianming, Chu, Jianbo
Format	Journal Article
Language	English
Published	Basel MDPI AG 01.05.2024
Subjects	Algorithms Artificial intelligence Attention Benchmarks Classification Critical components Deep learning Design Distance learning Effectiveness Electric transformers Eye movements High resolution high-resolution remote-sensing images Image acquisition Image classification Image processing Image resolution Image segmentation Information processing Machine learning Memory Memory (Computers) Methods Modules Neural networks Optimization Pixels pseudo-label Remote sensing Saccadic eye movements Semantic segmentation Semantics transformer Transformers Valproic acid Visual perception
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Pixel-level classification of very-high-resolution images is a crucial yet challenging task in remote sensing. While transformers have demonstrated effectiveness in capturing dependencies, their tendency to partition images into patches may restrict their applicability to highly detailed remote sensing images. To extract latent contextual semantic information from high-resolution remote sensing images, we proposed a gaze–saccade transformer (GSV-Trans) with visual perceptual attention. GSV-Trans incorporates a visual perceptual attention (VPA) mechanism that dynamically allocates computational resources based on the semantic complexity of the image. The VPA mechanism includes both gaze attention and eye movement attention, enabling the model to focus on the most critical parts of the image and acquire competitive semantic information. Additionally, to capture contextual semantic information across different levels in the image, we designed an inter-layer short-term visual memory module with bidirectional affinity propagation to guide attention allocation. Furthermore, we introduced a dual-branch pseudo-label module (DBPL) that imposes pixel-level and category-level semantic constraints on both gaze and saccade branches. DBPL encourages the model to extract domain-invariant features and align semantic information across different domains in the feature space. Extensive experiments on multiple pixel-level classification benchmarks confirm the effectiveness and superiority of our method over the state of the art.
ISSN:	2072-4292 2072-4292
DOI:	10.3390/rs16091514