Cascading attention enhancement network for RGB-D indoor scene segmentation

Convolutional neural network based Red, Green, Blue, and Depth (RGB-D) image semantic segmentation for indoor scenes has attracted increasing attention, because of its great potentiality of extracting semantic information from RGB-D images. However, the challenge it brings lies in how to effectively...

Full description

Saved in:
Bibliographic Details
Published inComputer vision and image understanding Vol. 259; p. 104411
Main Authors Tang, Xu, Cen, Songyang, Deng, Zhanhao, Zhang, Zejun, Meng, Yan, Xie, Jianxiao, Tang, Changbing, Zhang, Weichuan, Zhao, Guanghui
Format Journal Article
LanguageEnglish
Published Elsevier Inc 01.09.2025
Subjects
Online AccessGet full text
ISSN1077-3142
DOI10.1016/j.cviu.2025.104411

Cover

Loading…
More Information
Summary:Convolutional neural network based Red, Green, Blue, and Depth (RGB-D) image semantic segmentation for indoor scenes has attracted increasing attention, because of its great potentiality of extracting semantic information from RGB-D images. However, the challenge it brings lies in how to effectively fuse features from RGB and depth images within the neural network architecture. The technical approach of feature aggregation has evolved from the early integration of RGB color images and depth images to the current cross-attention fusion, which enables the features of different RGB channels to be fully integrated with ones of the depth image. However, noises and useless feature for segmentation are inevitably propagated between feature layers during the period of feature aggregation, thereby affecting the accuracy of segmentation results. In this paper, for indoor scenes, a cascading attention enhancement network (CAENet) is proposed with the aim of progressively refining the semantic features of RGB and depth images layer by layer, consisting of four modules: a channel enhancement module (CEM), an adaptive aggregation of spatial attention (AASA), an adaptive aggregation of channel attention (AACA), and a triple-path fusion module (TFM). In encoding stage, CEM complements RGB features with depth features at the end of each layer, in order to effectively revise RGB features for the next layer. At the end of encoding stage, AASA module combines low-level and high-level RGB semantic features by their spatial attention, and AACA module fuses low-level and high-level depth semantic features by their channel attention. The combined RGB and depth semantic features are fused into one and fed into the decoding stage, which consists of triple-path fusion modules (TFMs) combining low-level RGB and depth semantic features and decoded high-level semantic features. The TFM outputs multi-scale feature maps that encapsulate both rich semantic information and fine-grained details, thereby augmenting the model’s capacity for accurate per-pixel semantic label prediction. The proposed CAENet achieves mIoU of 52.0% on NYUDv2 and 48.3% on SUNRGB-D datasets, outperforming recent RGB-D segmentation methods. •A Channel Enhancement Module (CEM) fuses RGB-D features and suppresses noise.•Adaptive attention modules (AASA & AACA) fuse multi-level RGB and depth features.•A Triple-Path Fusion module (TFM) fuses multi-level features for segmentation.
ISSN:1077-3142
DOI:10.1016/j.cviu.2025.104411