PDANet: Self-Supervised Monocular Depth Estimation Using Perceptual and Data Augmentation Consistency

In recent studies, self-supervised learning methods have been explored for monocular depth estimation. They minimize the reconstruction loss of images instead of depth information as a supervised signal. However, existing methods usually assume that the corresponding points in different views should...

Full description

Saved in:

Bibliographic Details
Published in	Applied sciences Vol. 11; no. 12; p. 5383
Main Authors	Gao, Huachen, Liu, Xiaoyu, Qu, Meixia, Huang, Shijie
Format	Journal Article
Language	English
Published	Basel MDPI AG 01.06.2021
Subjects	Ablation Cameras Data augmentation depth estimation Depth perception Error analysis Hypotheses Image enhancement low texture perceptual consistency self-supervised Semantics Sensors Teaching methods
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In recent studies, self-supervised learning methods have been explored for monocular depth estimation. They minimize the reconstruction loss of images instead of depth information as a supervised signal. However, existing methods usually assume that the corresponding points in different views should have the same color, which leads to unreliable unsupervised signals and ultimately damages the reconstruction loss during the training. Meanwhile, in the low texture region, it is unable to predict the disparity value of pixels correctly because of the small number of extracted features. To solve the above issues, we propose a network—PDANet—that integrates perceptual consistency and data augmentation consistency, which are more reliable unsupervised signals, into a regular unsupervised depth estimation model. Specifically, we apply a reliable data augmentation mechanism to minimize the loss of the disparity map generated by the original image and the augmented image, respectively, which will enhance the robustness of the image in the prediction of color fluctuation. At the same time, we aggregate the features of different layers extracted by a pre-trained VGG16 network to explore the higher-level perceptual differences between the input image and the generated one. Ablation studies demonstrate the effectiveness of each components, and PDANet shows high-quality depth estimation results on the KITTI benchmark, which optimizes the state-of-the-art method from 0.114 to 0.084, measured by absolute relative error for depth estimation.
ISSN:	2076-3417 2076-3417
DOI:	10.3390/app11125383