Self-Supervised Monocular Depth Estimation Using Hybrid Transformer Encoder

Depth estimation using monocular camera sensors is an important technique in computer vision. Supervised monocular depth estimation requires a lot of data acquired from depth sensors. However, acquiring depth data is an expensive task. We sometimes cannot acquire data due to the limitations of the s...

Full description

Saved in:

Bibliographic Details
Published in	IEEE sensors journal Vol. 22; no. 19; pp. 18762 - 18770
Main Authors	Hwang, Seung-Jun, Park, Sung-Jun, Baek, Joong-Hwan, Kim, Byungkyu
Format	Journal Article
Language	English
Published	New York IEEE 01.10.2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Artificial neural networks Cameras Coders Computational modeling Computer vision Costs Data acquisition Depth estimation Estimation Feature extraction Image reconstruction monocular sensor estimation self-attention self-supervised Sensors Supervised learning transformer Transformers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Depth estimation using monocular camera sensors is an important technique in computer vision. Supervised monocular depth estimation requires a lot of data acquired from depth sensors. However, acquiring depth data is an expensive task. We sometimes cannot acquire data due to the limitations of the sensor. View synthesis-based depth estimation research is a self-supervised learning method that does not require depth data supervision. Previous studies mainly use the convolutional neural network (CNN)-based networks in encoders. The CNN is suitable for extracting local features through convolution operation. Recent vision transformers (ViTs) are suitable for global feature extraction based on multiself-attention modules. In this article, we propose a hybrid network combining the CNN and ViT networks in self-supervised learning-based monocular depth estimation. We design an encoder-decoder structure that uses CNNs in the earlier stage of extracting local features and a ViT in the later stages of extracting global features. We evaluate the proposed network through various experiments based on the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) and Cityscapes datasets. The results showed higher performance than previous studies and reduced parameters and computations. Codes and trained models are available at https://github.com/fogfog2/manydepthformer .
ISSN:	1530-437X 1558-1748
DOI:	10.1109/JSEN.2022.3199265