PCViT: A Pyramid Convolutional Vision Transformer Detector for Object Detection in Remote-Sensing Imagery
Remote-sensing object detection (RSOD) is a fundamental and valuable task in Earth monitoring. However, remote-sensing images (RSIs) are typically acquired from a bird's eye perspective, resulting in intrinsic properties such as complex backgrounds, random and dense distribution of objects, and...
Saved in:
Published in | IEEE transactions on geoscience and remote sensing Vol. 62; pp. 1 - 15 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Remote-sensing object detection (RSOD) is a fundamental and valuable task in Earth monitoring. However, remote-sensing images (RSIs) are typically acquired from a bird's eye perspective, resulting in intrinsic properties such as complex backgrounds, random and dense distribution of objects, and multiscale objects. These properties hinder the direct application of well-performed detection methods in the natural images (NIs) domain to the RSIs domain, thereby limiting the attainment of desired performance. To address this, we propose a pyramid convolutional vision transformer (PCViT) that gets rid of the limitations of existing transformer methods. First, we employ a pyramid architecture to effectively capture the multiscale information present in RSIs. To enhance the feature extraction capabilities of the transformer, we introduce a parallel convolution module (PCM) that complements the local information that may be missed by the transformer. Furthermore, we propose a self-supervised pretraining strategy called multiperspective pretraining (MPP) to pretrain the model and subsequently finetune it on the downstream detection task. During the finetuning stage, we introduce a local/global <inline-formula> <tex-math notation="LaTeX">{k} </tex-math></inline-formula>-NN attention (LGKA) to improve the token relationship establishment. In the neck part, we propose a feature-reflowing pyramid network (FRPN) to facilitate contextual information interaction and further enhance our PCViT's ability to process multiscale information. Experimental results on two representative datasets, namely NWPU VHR-10 and DIOR, demonstrate the effectiveness of our PCViT, as it achieves outstanding performance. These results highlight the suitability of PCViT for RSOD tasks. |
---|---|
ISSN: | 0196-2892 1558-0644 |
DOI: | 10.1109/TGRS.2024.3360456 |