Transformer fusion for indoor RGB-D semantic segmentation

Fusing geometric cues with visual appearance is an imperative theme for RGB-D indoor semantic segmentation. Existing methods commonly adopt convolutional modules to aggregate multi-modal features, paying little attention to explicitly leveraging the long-range dependencies in feature fusion. Therefo...

Full description

Saved in:
Bibliographic Details
Published inComputer vision and image understanding Vol. 249; p. 104174
Main Authors Wu, Zongwei, Zhou, Zhuyun, Allibert, Guillaume, Stolz, Christophe, Demonceaux, Cédric, Ma, Chao
Format Journal Article
LanguageEnglish
Published Elsevier Inc 01.12.2024
Elsevier
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Fusing geometric cues with visual appearance is an imperative theme for RGB-D indoor semantic segmentation. Existing methods commonly adopt convolutional modules to aggregate multi-modal features, paying little attention to explicitly leveraging the long-range dependencies in feature fusion. Therefore, it is challenging for existing methods to accurately segment objects with large-scale variations. In this paper, we propose a novel transformer-based fusion scheme, named TransD-Fusion, to better model contextualized awareness. Specifically, TransD-Fusion consists of a self-refinement module, a calibration scheme with cross-interaction, and a depth-guided fusion. The objective is to first improve modality-specific features with self- and cross-attention, and then explore the geometric cues to better segment objects sharing a similar visual appearance. Additionally, our transformer fusion benefits from a semantic-aware position encoding which spatially constrains the attention to neighboring pixels. Extensive experiments on RGB-D benchmarks demonstrate that the proposed method performs well over the state-of-the-art methods by large margins. •We propose a novel transformer-based multi-model fusion for RGB-D semantic segmentation.•We design a semantic-aware position encoding which is dynamically generated from a modality-specific sequence of tokens by a convolutional layer, yielding a spatial constraints on neighboring for accurate segmentation.•Our network performs favorably over the SOTA methods on large-scale benchmark datasets by large margins.
ISSN:1077-3142
1090-235X
DOI:10.1016/j.cviu.2024.104174