A Transformer-based multi-modal fusion network for semantic segmentation of high-resolution remote sensing imagery

Semantic segmentation of high-resolution multispectral remote sensing image has been intensely studied. However, the shadow occlusions, or the similar color and textures, between the categories influence the segmentation accuracy. Concomitantly, the size of targets in the remote sensing images is di...

Full description

Saved in:
Bibliographic Details
Published inInternational journal of applied earth observation and geoinformation Vol. 133; p. 104083
Main Authors Liu, Yutong, Gao, Kun, Wang, Hong, Yang, Zhijia, Wang, Pengyu, Ji, Shijing, Huang, Yanjun, Zhu, Zhenyu, Zhao, Xiaobin
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.09.2024
Elsevier
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Semantic segmentation of high-resolution multispectral remote sensing image has been intensely studied. However, the shadow occlusions, or the similar color and textures, between the categories influence the segmentation accuracy. Concomitantly, the size of targets in the remote sensing images is diverse and the network cannot balance their segmentation. This paper introduces a network, Transformer-based Multi-modal Fusion Network (TMFNet), which fuses the multi-modal features and incorporates height features from the digital surface model (DSM) to supplement the extra different features between each category. Particularly, we introduce two parallel encoders to extract the features from different modalities, a Multi-Modal fusion model based on the Transformer (MMformer) to complete the multi-modal fusion, and a Border Region Attention based multi-level Fusion Module (BRAFM) to integrate the cross-level features and enhance the small target segmentation by utilizing the details around the border. The experiment results on the ISPRS Vaihingen and Potsdam benchmark datasets indicate that the proposed TMFNet outperforms the SOTA methods on the segmentation performance. •A Transformer-based fusion module to fuse multi-modal feature for segmentation.•A cross-level fusion module to recover small target feature.•Accurate segmentation on the targets with similar color or occluded by shadow.•Small targets can be recognized with complete border region.
ISSN:1569-8432
DOI:10.1016/j.jag.2024.104083