DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

This paper aims to address the problem of supervised monocular depth estimation. We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation. Moreover, the Transformer and convolution are good at long-range and close-range depth es...

Full description

Saved in:

Bibliographic Details
Published in	International journal of automation and computing Vol. 20; no. 6; pp. 837 - 854
Main Authors	Li, Zhenyu, Chen, Zehui, Liu, Xianming, Jiang, Junjun
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.12.2023 Springer Nature B.V
Subjects	Ablation Artificial Intelligence Competition Computer Science Convolution Datasets Feature maps Formability Modules Pilot projects Research Article Transformers convolution 3D reconstruction Transformer Autonomous driving monocular depth estimation
Online Access	Get full text
ISSN	2731-538X 1476-8186 2153-182X 2731-5398 1751-8520 2153-1838
DOI	10.1007/s11633-023-1458-0

Cover

Loading…

More Information
Summary:	This paper aims to address the problem of supervised monocular depth estimation. We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation. Moreover, the Transformer and convolution are good at long-range and close-range depth estimation, respectively. Therefore, we propose to adopt a parallel encoder architecture consisting of a Transformer branch and a convolution branch. The former can model global context with the effective attention mechanism and the latter aims to preserve the local information as the Transformer lacks the spatial inductive bias in modeling such contents. However, independent branches lead to a shortage of connections between features. To bridge this gap, we design a hierarchical aggregation and heterogeneous interaction module to enhance the Transformer features and model the affinity between the heterogeneous features in a set-to-set translation manner. Due to the unbearable memory cost introduced by the global attention on high-resolution feature maps, we adopt the deformable scheme to reduce the complexity. Extensive experiments on the KITTI, NYU, and SUN RGB-D datasets demonstrate that our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins. The effectiveness of each proposed module is elaborately evaluated through meticulous and intensive ablation studies.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2731-538X 1476-8186 2153-182X 2731-5398 1751-8520 2153-1838
DOI:	10.1007/s11633-023-1458-0