PCViT: A Pyramid Convolutional Vision Transformer Detector for Object Detection in Remote-Sensing Imagery

Remote-sensing object detection (RSOD) is a fundamental and valuable task in Earth monitoring. However, remote-sensing images (RSIs) are typically acquired from a bird's eye perspective, resulting in intrinsic properties such as complex backgrounds, random and dense distribution of objects, and...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on geoscience and remote sensing Vol. 62; pp. 1 - 15
Main Authors Li, Jiaojiao, Tian, Penghao, Song, Rui, Xu, Haitao, Li, Yunsong, Du, Qian
Format Journal Article
LanguageEnglish
Published New York IEEE 2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Remote-sensing object detection (RSOD) is a fundamental and valuable task in Earth monitoring. However, remote-sensing images (RSIs) are typically acquired from a bird's eye perspective, resulting in intrinsic properties such as complex backgrounds, random and dense distribution of objects, and multiscale objects. These properties hinder the direct application of well-performed detection methods in the natural images (NIs) domain to the RSIs domain, thereby limiting the attainment of desired performance. To address this, we propose a pyramid convolutional vision transformer (PCViT) that gets rid of the limitations of existing transformer methods. First, we employ a pyramid architecture to effectively capture the multiscale information present in RSIs. To enhance the feature extraction capabilities of the transformer, we introduce a parallel convolution module (PCM) that complements the local information that may be missed by the transformer. Furthermore, we propose a self-supervised pretraining strategy called multiperspective pretraining (MPP) to pretrain the model and subsequently finetune it on the downstream detection task. During the finetuning stage, we introduce a local/global <inline-formula> <tex-math notation="LaTeX">{k} </tex-math></inline-formula>-NN attention (LGKA) to improve the token relationship establishment. In the neck part, we propose a feature-reflowing pyramid network (FRPN) to facilitate contextual information interaction and further enhance our PCViT's ability to process multiscale information. Experimental results on two representative datasets, namely NWPU VHR-10 and DIOR, demonstrate the effectiveness of our PCViT, as it achieves outstanding performance. These results highlight the suitability of PCViT for RSOD tasks.
ISSN:0196-2892
1558-0644
DOI:10.1109/TGRS.2024.3360456