MCFT: Multimodal Contrastive Fusion Transformer for Classification of Hyperspectral Image and LiDAR Data

Multi-source remote sensing (RS) image fusion leverages data from various sensors to enhance the accuracy and comprehensiveness of earth observation. Notably, the fusion of hyperspectral (HS) images and light detection and ranging (LiDAR) data has garnered significant attention due to their compleme...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on geoscience and remote sensing p. 1
Main Authors	Feng, Yining, Jin, Jiarui, Yin, Yin, Song, Chuanming, Wang, Xianghai
Format	Journal Article
Language	English
Published	IEEE 01.11.2024
Subjects	Accuracy Computer vision contrastive learning Convolutional neural networks Data mining Data models deep learning Electronic mail feature alignment Feature extraction feature matching Head HS-LiDAR fusion and classification Laser radar Transformers vision transformer
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Multi-source remote sensing (RS) image fusion leverages data from various sensors to enhance the accuracy and comprehensiveness of earth observation. Notably, the fusion of hyperspectral (HS) images and light detection and ranging (LiDAR) data has garnered significant attention due to their complementary features. However, current methods predominantly rely on simplistic techniques such as weight sharing, feature superposition, or feature products, which often fall short of achieving true feature fusion. These methods primarily focus on feature accumulation rather than integrative fusion. The Transformer framework, with its self-attention mechanisms, offers potential for effective multimodal data fusion. However, simple linear transformations used in feature extraction may not adequately capture all relevant information. To address these challenges, we propose a novel multimodal contrastive fusion Transformer (MCFT). Our approach employs convolutional neural networks (CNNs) for feature extraction from different modalities and leverages Transformer networks for advanced fusion. We have modified the basic Transformer architecture and propose a double position embedding mode to make it more suitable for RS image processing tasks. We introduce two novel modules: feature alignment module and feature matching module, designed to exploit both paired and unpaired samples. These modules facilitate more effective cross-modal learning by emphasizing the commonalities within the same features and the differences between features from distinct modalities. Experimental evaluations on several publicly available HS-LiDAR datasets demonstrate that proposed method consistently outperforms existing advanced methods. The source code for our approach is available at: https://github.com/SYFYN0317/MCFT.
ISSN:	0196-2892 1558-0644
DOI:	10.1109/TGRS.2024.3490752