A Novel Network-Level Fusion Architecture of Proposed Self-Attention and Vision Transformer Models for Land Use and Land Cover Classification From Remote Sensing Images

Convolutional neural networks (CNNs), in particular, demonstrate the remarkable power of feature learning in remote sensing for land use and cover classification, as demonstrated by recent deep learning techniques driven by vast amounts of data. In this work, we proposed a new network-level fusion d...

Full description

Saved in:

Bibliographic Details
Published in	IEEE journal of selected topics in applied earth observations and remote sensing Vol. 17; pp. 13135 - 13148
Main Authors	Rubab, Saddaf, Khan, Muhammad Attique, Hamza, Ameer, Albarakati, Hussain Mobarak, Saidani, Oumaima, Alshardan, Amal, Alasiry, Areej, Marzougui, Mehrez, Nam, Yunyoung
Format	Journal Article
Language	English
Published	Piscataway IEEE 2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Ablation Accuracy Artificial neural networks Augmentation Bayesian analysis Classification Computer architecture Convolutional neural networks Data augmentation Data mining Datasets Deep learning explainable AI Feature extraction fusion Image processing Land cover Land surface Land use Machine learning Mathematical models Neural networks Probability theory Remote sensing remote sensing (RS) self-attention Transformers vision transformer (ViT)
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Convolutional neural networks (CNNs), in particular, demonstrate the remarkable power of feature learning in remote sensing for land use and cover classification, as demonstrated by recent deep learning techniques driven by vast amounts of data. In this work, we proposed a new network-level fusion deep architecture based on 16-tiny Vision Transformer and SIBNet. In the initial phase, data augmentation has been performed to resolve the problem of data imbalances. In the next step, we proposed a self-attention bottleneck-based inception CNN network named SIBNet. In this network, two architectures are followed. The blocks are designed using inception architecture, and each inception module is created with bottleneck blocks. The 16-tiny vision transformer architecture has been implemented for RS images and fused using a network-level fusion with SIBNet for the first time. Hyperparameters of the proposed model have been initialized using Bayesian Optimization for better training on the RS images. After the fusion, the model was on RS image datasets and extracted deep features from the self-attention layer. The extracted features are classified using a neural network classifier with multiple hidden layers. The experimental process of the proposed architecture has been performed on two publically available datasets, such as EuroSAT and NWPU, and obtained an accuracy of 97.8 and 98.9%, respectively. A detailed ablation study has been performed to test the proposed models and shows that the fusion model achieved improved accuracy. In addition, a comparison is conducted with recent techniques and proposed methods, showing improved precision, recall, and accuracy.
ISSN:	1939-1404 2151-1535
DOI:	10.1109/JSTARS.2024.3426950