DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

This paper aims to address the problem of supervised monocular depth estimation. We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation. Moreover, the Transformer and convolution are good at long-range and close-range depth es...

Full description

Saved in:
Bibliographic Details
Published inInternational journal of automation and computing Vol. 20; no. 6; pp. 837 - 854
Main Authors Li, Zhenyu, Chen, Zehui, Liu, Xianming, Jiang, Junjun
Format Journal Article
LanguageEnglish
Published Berlin/Heidelberg Springer Berlin Heidelberg 01.12.2023
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN2731-538X
1476-8186
2153-182X
2731-5398
1751-8520
2153-1838
DOI10.1007/s11633-023-1458-0

Cover

Loading…
More Information
Summary:This paper aims to address the problem of supervised monocular depth estimation. We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation. Moreover, the Transformer and convolution are good at long-range and close-range depth estimation, respectively. Therefore, we propose to adopt a parallel encoder architecture consisting of a Transformer branch and a convolution branch. The former can model global context with the effective attention mechanism and the latter aims to preserve the local information as the Transformer lacks the spatial inductive bias in modeling such contents. However, independent branches lead to a shortage of connections between features. To bridge this gap, we design a hierarchical aggregation and heterogeneous interaction module to enhance the Transformer features and model the affinity between the heterogeneous features in a set-to-set translation manner. Due to the unbearable memory cost introduced by the global attention on high-resolution feature maps, we adopt the deformable scheme to reduce the complexity. Extensive experiments on the KITTI, NYU, and SUN RGB-D datasets demonstrate that our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins. The effectiveness of each proposed module is elaborately evaluated through meticulous and intensive ablation studies.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2731-538X
1476-8186
2153-182X
2731-5398
1751-8520
2153-1838
DOI:10.1007/s11633-023-1458-0