PCViT: A Pyramid Convolutional Vision Transformer Detector for Object Detection in Remote-Sensing Imagery

Remote-sensing object detection (RSOD) is a fundamental and valuable task in Earth monitoring. However, remote-sensing images (RSIs) are typically acquired from a bird's eye perspective, resulting in intrinsic properties such as complex backgrounds, random and dense distribution of objects, and...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on geoscience and remote sensing Vol. 62; pp. 1 - 15
Main Authors	Li, Jiaojiao, Tian, Penghao, Song, Rui, Xu, Haitao, Li, Yunsong, Du, Qian
Format	Journal Article
Language	English
Published	New York IEEE 2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Convolution Convolutional neural network (CNN) Detection Detectors Feature extraction feature pyramid network (FPN) Image acquisition Image processing Information processing multiscale object detection Nickel Object detection Object recognition Remote monitoring Remote sensing remote-sensing images (RSIs) Semantics Task analysis Transformers Vision vision transformer (ViT)
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Remote-sensing object detection (RSOD) is a fundamental and valuable task in Earth monitoring. However, remote-sensing images (RSIs) are typically acquired from a bird's eye perspective, resulting in intrinsic properties such as complex backgrounds, random and dense distribution of objects, and multiscale objects. These properties hinder the direct application of well-performed detection methods in the natural images (NIs) domain to the RSIs domain, thereby limiting the attainment of desired performance. To address this, we propose a pyramid convolutional vision transformer (PCViT) that gets rid of the limitations of existing transformer methods. First, we employ a pyramid architecture to effectively capture the multiscale information present in RSIs. To enhance the feature extraction capabilities of the transformer, we introduce a parallel convolution module (PCM) that complements the local information that may be missed by the transformer. Furthermore, we propose a self-supervised pretraining strategy called multiperspective pretraining (MPP) to pretrain the model and subsequently finetune it on the downstream detection task. During the finetuning stage, we introduce a local/global <inline-formula> <tex-math notation="LaTeX">{k} </tex-math></inline-formula>-NN attention (LGKA) to improve the token relationship establishment. In the neck part, we propose a feature-reflowing pyramid network (FRPN) to facilitate contextual information interaction and further enhance our PCViT's ability to process multiscale information. Experimental results on two representative datasets, namely NWPU VHR-10 and DIOR, demonstrate the effectiveness of our PCViT, as it achieves outstanding performance. These results highlight the suitability of PCViT for RSOD tasks.
ISSN:	0196-2892 1558-0644
DOI:	10.1109/TGRS.2024.3360456