Structure perception and edge refinement network for monocular depth estimation

Monocular depth estimation is fundamental for scene understanding and visual downstream tasks. In recent years, with the development of deep learning, increasing complex networks and powerful mechanisms have significantly improved the performance of monocular depth estimation. Nevertheless, predicti...

Full description

Saved in:

Bibliographic Details
Published in	Computer vision and image understanding Vol. 256; p. 104348
Main Authors	Zuo, Shuangquan, Xiao, Yun, Wang, Xuanhong, Lv, Hao, Chen, Hongwei
Format	Journal Article
Language	English
Published	Elsevier Inc 01.05.2025
Subjects	Deep learning Mixed attention Monocular depth estimation Multi-scale features Deep learning Mixed attention Monocular depth estimation Multi-scale features
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Monocular depth estimation is fundamental for scene understanding and visual downstream tasks. In recent years, with the development of deep learning, increasing complex networks and powerful mechanisms have significantly improved the performance of monocular depth estimation. Nevertheless, predicting dense pixel depths from a single RGB image remains challenging due to the ill-posed issues and inherent ambiguity. Two unresolved issues persist: (1) Depth features are limited in perceiving the scene structure accurately, leading to inaccurate region estimation. (2) Low-level features, which are rich in details, are not fully utilized, causing the missing of details and ambiguous edges. The crux to accurate dense depth restoration is to efficiently handle global scene structure as well as local details. To solve these two issues, we propose the Scene perception and Edge refinement network for Monocular Depth Estimation (SE-MDE). Specifically, we carefully design a depth-enhanced encoder (DEE) to effectively perceive the overall structure of the scene while refining the feature responses of different regions. Meanwhile, we introduce a dense edge-guided network (DENet) that maximizes the utilization of low-level features to enhance the depth of details and edges. Extensive experiments validate the effectiveness of our method, with several experimental results on the NYU v2 indoor dataset and KITTI outdoor dataset demonstrate the state-of-the-art performance of the proposed method. •To solve the inherent ill-posed problem, we propose a novel structure-perception and edge-refinement monocular depth estimation method.•We design a depth-enhanced encoder to further emphasize the scene structure while capturing the global context.•We redesign a dense edge-guided decoder network based on edge aware blocks to fully mine the depth of detail information in low-level features.
ISSN:	1077-3142
DOI:	10.1016/j.cviu.2025.104348