MCFT: Multimodal Contrastive Fusion Transformer for Classification of Hyperspectral Image and LiDAR Data

Multi-source remote sensing (RS) image fusion leverages data from various sensors to enhance the accuracy and comprehensiveness of earth observation. Notably, the fusion of hyperspectral (HS) images and light detection and ranging (LiDAR) data has garnered significant attention due to their compleme...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on geoscience and remote sensing p. 1
Main Authors Feng, Yining, Jin, Jiarui, Yin, Yin, Song, Chuanming, Wang, Xianghai
Format Journal Article
LanguageEnglish
Published IEEE 01.11.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Multi-source remote sensing (RS) image fusion leverages data from various sensors to enhance the accuracy and comprehensiveness of earth observation. Notably, the fusion of hyperspectral (HS) images and light detection and ranging (LiDAR) data has garnered significant attention due to their complementary features. However, current methods predominantly rely on simplistic techniques such as weight sharing, feature superposition, or feature products, which often fall short of achieving true feature fusion. These methods primarily focus on feature accumulation rather than integrative fusion. The Transformer framework, with its self-attention mechanisms, offers potential for effective multimodal data fusion. However, simple linear transformations used in feature extraction may not adequately capture all relevant information. To address these challenges, we propose a novel multimodal contrastive fusion Transformer (MCFT). Our approach employs convolutional neural networks (CNNs) for feature extraction from different modalities and leverages Transformer networks for advanced fusion. We have modified the basic Transformer architecture and propose a double position embedding mode to make it more suitable for RS image processing tasks. We introduce two novel modules: feature alignment module and feature matching module, designed to exploit both paired and unpaired samples. These modules facilitate more effective cross-modal learning by emphasizing the commonalities within the same features and the differences between features from distinct modalities. Experimental evaluations on several publicly available HS-LiDAR datasets demonstrate that proposed method consistently outperforms existing advanced methods. The source code for our approach is available at: https://github.com/SYFYN0317/MCFT.
ISSN:0196-2892
1558-0644
DOI:10.1109/TGRS.2024.3490752