Transformer fusion for indoor RGB-D semantic segmentation
Fusing geometric cues with visual appearance is an imperative theme for RGB-D indoor semantic segmentation. Existing methods commonly adopt convolutional modules to aggregate multi-modal features, paying little attention to explicitly leveraging the long-range dependencies in feature fusion. Therefo...
Saved in:
Published in | Computer vision and image understanding Vol. 249; p. 104174 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Inc
01.12.2024
Elsevier |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Fusing geometric cues with visual appearance is an imperative theme for RGB-D indoor semantic segmentation. Existing methods commonly adopt convolutional modules to aggregate multi-modal features, paying little attention to explicitly leveraging the long-range dependencies in feature fusion. Therefore, it is challenging for existing methods to accurately segment objects with large-scale variations. In this paper, we propose a novel transformer-based fusion scheme, named TransD-Fusion, to better model contextualized awareness. Specifically, TransD-Fusion consists of a self-refinement module, a calibration scheme with cross-interaction, and a depth-guided fusion. The objective is to first improve modality-specific features with self- and cross-attention, and then explore the geometric cues to better segment objects sharing a similar visual appearance. Additionally, our transformer fusion benefits from a semantic-aware position encoding which spatially constrains the attention to neighboring pixels. Extensive experiments on RGB-D benchmarks demonstrate that the proposed method performs well over the state-of-the-art methods by large margins.
•We propose a novel transformer-based multi-model fusion for RGB-D semantic segmentation.•We design a semantic-aware position encoding which is dynamically generated from a modality-specific sequence of tokens by a convolutional layer, yielding a spatial constraints on neighboring for accurate segmentation.•Our network performs favorably over the SOTA methods on large-scale benchmark datasets by large margins. |
---|---|
ISSN: | 1077-3142 1090-235X |
DOI: | 10.1016/j.cviu.2024.104174 |