A Novel Network-Level Fusion Architecture of Proposed Self-Attention and Vision Transformer Models for Land Use and Land Cover Classification From Remote Sensing Images

Convolutional neural networks (CNNs), in particular, demonstrate the remarkable power of feature learning in remote sensing for land use and cover classification, as demonstrated by recent deep learning techniques driven by vast amounts of data. In this work, we proposed a new network-level fusion d...

Full description

Saved in:
Bibliographic Details
Published inIEEE journal of selected topics in applied earth observations and remote sensing Vol. 17; pp. 13135 - 13148
Main Authors Rubab, Saddaf, Khan, Muhammad Attique, Hamza, Ameer, Albarakati, Hussain Mobarak, Saidani, Oumaima, Alshardan, Amal, Alasiry, Areej, Marzougui, Mehrez, Nam, Yunyoung
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Convolutional neural networks (CNNs), in particular, demonstrate the remarkable power of feature learning in remote sensing for land use and cover classification, as demonstrated by recent deep learning techniques driven by vast amounts of data. In this work, we proposed a new network-level fusion deep architecture based on 16-tiny Vision Transformer and SIBNet. In the initial phase, data augmentation has been performed to resolve the problem of data imbalances. In the next step, we proposed a self-attention bottleneck-based inception CNN network named SIBNet. In this network, two architectures are followed. The blocks are designed using inception architecture, and each inception module is created with bottleneck blocks. The 16-tiny vision transformer architecture has been implemented for RS images and fused using a network-level fusion with SIBNet for the first time. Hyperparameters of the proposed model have been initialized using Bayesian Optimization for better training on the RS images. After the fusion, the model was on RS image datasets and extracted deep features from the self-attention layer. The extracted features are classified using a neural network classifier with multiple hidden layers. The experimental process of the proposed architecture has been performed on two publically available datasets, such as EuroSAT and NWPU, and obtained an accuracy of 97.8 and 98.9%, respectively. A detailed ablation study has been performed to test the proposed models and shows that the fusion model achieved improved accuracy. In addition, a comparison is conducted with recent techniques and proposed methods, showing improved precision, recall, and accuracy.
ISSN:1939-1404
2151-1535
DOI:10.1109/JSTARS.2024.3426950