Transformer-Based Multi-layer Feature Aggregation and Rotated Anchor Matching for Oriented Object Detection in Remote Sensing Images

Object detection has made significant progress in computer vision. However, challenges remain in detecting small, arbitrarily oriented, and densely distributed objects, especially in aerial remote sensing images. This paper presents MATDet, an end-to-end encoder-decoder detection network based on th...

Full description

Saved in:

Bibliographic Details
Published in	Arabian journal for science and engineering (2011) Vol. 49; no. 9; pp. 12935 - 12951
Main Authors	Jin, Chuan, Zheng, Anqi, Wu, Zhaoying, Tong, Changqing
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 2024 Springer Nature B.V
Subjects	Artificial neural networks Boxes Computer vision Encoders-Decoders Engineering Feature extraction Feature maps Humanities and Social Sciences Matching multidisciplinary Multilayers Object recognition Remote sensing Research Article-Computer Engineering and Computer Science Science Spatial resolution Transformers Oriented object detection Transformer Remote sensing Multi-layer feature aggregation Rotated anchor matching
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Object detection has made significant progress in computer vision. However, challenges remain in detecting small, arbitrarily oriented, and densely distributed objects, especially in aerial remote sensing images. This paper presents MATDet, an end-to-end encoder-decoder detection network based on the Transformer designed for oriented object detection. The network employs multi-layer feature aggregation and rotated anchor matching methods to improve oriented small and densely distributed object detection accuracy. Specifically, the encoder is responsible for encoding labeled image blocks using convolutional neural network (CNN) feature maps. It efficiently fuses these blocks with higher resolution multi-scale features through cross-layer connections, facilitating the extraction of global contextual information. The decoder then performs an upsampling of the encoded features, effectively recovering the full spatial resolution of the feature maps to capture essential local–global semantic features for accurate object localization. In addition, high quality proposed anchor boxes are generated by refined convolution, and the convolved features are adaptively aligned according to the anchor boxes to reduce redundant computation. The proposed MATDet achieves mAPs of 80.35%, 78.83%, 73.60%, and 98.01% on the DOTAv1.0, DOTAv1.5, DIOR, and HRSC2016 datasets, respectively, proving that it outperforms the baseline model for oriented object detection. This validation confirms the feasibility and effectiveness of the proposed methods.
ISSN:	2193-567X 1319-8025 2191-4281
DOI:	10.1007/s13369-024-08892-z