Monocular Depth Estimation Network Based on Swin Transformer
Abstract Estimating depth from a single image is challenging because a single 2D image may correspond to many different 3D scenes with the same depth. While most deep learning based depth prediction methods extract depth features using small convolutional kernels with small receptive fields, which r...
Saved in:
Published in | Journal of physics. Conference series Vol. 2428; no. 1; pp. 12019 - 12024 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Bristol
IOP Publishing
01.02.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Abstract
Estimating depth from a single image is challenging because a single 2D image may correspond to many different 3D scenes with the same depth. While most deep learning based depth prediction methods extract depth features using small convolutional kernels with small receptive fields, which results in deformed depth edges and inaccurate depth values of distant objects in the depth estimation results. Aiming at this problem, we propose a depth estimation network based on Swin Transformer and the encoder-decoder structure. We construct the encoder using the Swin Transformer network, which can encode long-range spatial dependency and extract features on various scales and across different channels. The decoder of the proposed network is in charge of fusing the features from the encoder by the operations of interpolation, concatenation, and convolution. Experiments on KITTI and NYUv2 datasets show that our proposed network can get more accurate depth edges and depth values than the state-of-the-art methods. |
---|---|
ISSN: | 1742-6588 1742-6596 |
DOI: | 10.1088/1742-6596/2428/1/012019 |