Joint Multi-Scale CNN and Vision Transformer for Hyperspectral Image Classification

Convolutional neural network (CNN) has shown great performance in the study of hyperspectral image (HSI) classification. However, HSI contains hundreds of continuous spectral bands, and the CNN-based HSI classification methods neglect the deeper sequence semantic information in the spectrum. The spe...

Full description

Saved in:

Bibliographic Details
Published in	2024 IEEE 2nd International Conference on Control, Electronics and Computer Technology (ICCECT) pp. 364 - 369
Main Authors	Sun, Rui, Xiang, Jianhong, Wang, Linyu
Format	Conference Proceeding
Language	English
Published	IEEE 26.04.2024
Subjects	Convolutional neural networks Feature extraction hyperspectral image classification IP networks MCSS-ViT Merging Semantics Task analysis Transformers vision transformer
Online Access	Get full text
DOI	10.1109/ICCECT60629.2024.10546120

Cover

More Information
Summary:	Convolutional neural network (CNN) has shown great performance in the study of hyperspectral image (HSI) classification. However, HSI contains hundreds of continuous spectral bands, and the CNN-based HSI classification methods neglect the deeper sequence semantic information in the spectrum. The spectral sequence information of HSI can be better processed by the Transformer. In this article, a new network MCSS-ViT is designed to cater to vital spectral-spatial features extraction of HSI at different levels. The MCSS-ViT is composed of the multi-scale CNN based on residual and channel attention module (MCRC) with the vision transformer (ViT), where the MCRC involves both the CRC Block and the AIC Block. First, principal component analysis (PCA) is employed to reduce the spectral dimension of the HSI. Subsequently, the CRC block is constructed to fully learn the spectral-spatial information of patches. Meanwhile, in order to avoid losing important multi-scale spatial information, the AIC block is developed for capturing complementary spatial features. Finally, the SS-Former is introduced to satisfy the extraction of global and semantic features from the image. The performance of MCSS-ViT was evaluated on two datasets. The conducted experiments showed that the proposed method had better classification effect than other classical methods.
DOI:	10.1109/ICCECT60629.2024.10546120