Transformer fusion for indoor RGB-D semantic segmentation

Fusing geometric cues with visual appearance is an imperative theme for RGB-D indoor semantic segmentation. Existing methods commonly adopt convolutional modules to aggregate multi-modal features, paying little attention to explicitly leveraging the long-range dependencies in feature fusion. Therefo...

Full description

Saved in:

Bibliographic Details
Published in	Computer vision and image understanding Vol. 249; p. 104174
Main Authors	Wu, Zongwei, Zhou, Zhuyun, Allibert, Guillaume, Stolz, Christophe, Demonceaux, Cédric, Ma, Chao
Format	Journal Article
Language	English
Published	Elsevier Inc 01.12.2024 Elsevier
Subjects	Computer Science Computer Vision and Pattern Recognition Mathematics Numerical Analysis RGB-D Semantic Segmentation Transformer RGB-D Transformer 41A10 65D05 Semantic Segmentation 65D17 41A05
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Fusing geometric cues with visual appearance is an imperative theme for RGB-D indoor semantic segmentation. Existing methods commonly adopt convolutional modules to aggregate multi-modal features, paying little attention to explicitly leveraging the long-range dependencies in feature fusion. Therefore, it is challenging for existing methods to accurately segment objects with large-scale variations. In this paper, we propose a novel transformer-based fusion scheme, named TransD-Fusion, to better model contextualized awareness. Specifically, TransD-Fusion consists of a self-refinement module, a calibration scheme with cross-interaction, and a depth-guided fusion. The objective is to first improve modality-specific features with self- and cross-attention, and then explore the geometric cues to better segment objects sharing a similar visual appearance. Additionally, our transformer fusion benefits from a semantic-aware position encoding which spatially constrains the attention to neighboring pixels. Extensive experiments on RGB-D benchmarks demonstrate that the proposed method performs well over the state-of-the-art methods by large margins. •We propose a novel transformer-based multi-model fusion for RGB-D semantic segmentation.•We design a semantic-aware position encoding which is dynamically generated from a modality-specific sequence of tokens by a convolutional layer, yielding a spatial constraints on neighboring for accurate segmentation.•Our network performs favorably over the SOTA methods on large-scale benchmark datasets by large margins.
ISSN:	1077-3142 1090-235X
DOI:	10.1016/j.cviu.2024.104174