Monocular Depth Estimation Network Based on Swin Transformer

Abstract Estimating depth from a single image is challenging because a single 2D image may correspond to many different 3D scenes with the same depth. While most deep learning based depth prediction methods extract depth features using small convolutional kernels with small receptive fields, which r...

Full description

Saved in:

Bibliographic Details
Published in	Journal of physics. Conference series Vol. 2428; no. 1; pp. 12019 - 12024
Main Authors	Yu, Shangbin, Zhang, Renyan, Ma, Shuaiye, Jiang, Xinfang
Format	Journal Article
Language	English
Published	Bristol IOP Publishing 01.02.2023
Subjects	Coders Encoders-Decoders Estimation Feature extraction Interpolation Physics Transformers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Abstract Estimating depth from a single image is challenging because a single 2D image may correspond to many different 3D scenes with the same depth. While most deep learning based depth prediction methods extract depth features using small convolutional kernels with small receptive fields, which results in deformed depth edges and inaccurate depth values of distant objects in the depth estimation results. Aiming at this problem, we propose a depth estimation network based on Swin Transformer and the encoder-decoder structure. We construct the encoder using the Swin Transformer network, which can encode long-range spatial dependency and extract features on various scales and across different channels. The decoder of the proposed network is in charge of fusing the features from the encoder by the operations of interpolation, concatenation, and convolution. Experiments on KITTI and NYUv2 datasets show that our proposed network can get more accurate depth edges and depth values than the state-of-the-art methods.
ISSN:	1742-6588 1742-6596
DOI:	10.1088/1742-6596/2428/1/012019