Toward Domain Independence for Learning-Based Monocular Depth Estimation

Modern autonomous mobile robots require a strong understanding of their surroundings in order to safely operate in cluttered and dynamic environments. Monocular depth estimation offers a geometry-independent paradigm to detect free, navigable space with minimum space, and power consumption. These re...

Full description

Saved in:

Bibliographic Details
Published in	IEEE robotics and automation letters Vol. 2; no. 3; pp. 1778 - 1785
Main Authors	Mancini, Michele, Costante, Gabriele, Valigi, Paolo, Ciarfuglia, Thomas A., Delmerico, Jeffrey, Scaramuzza, Davide
Format	Journal Article
Language	English
Published	IEEE 01.07.2017
Subjects	Benchmark testing Cameras Collision avoidance Estimation Feature extraction range sensing Streaming media Training Vehicles visual-based navigation
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Modern autonomous mobile robots require a strong understanding of their surroundings in order to safely operate in cluttered and dynamic environments. Monocular depth estimation offers a geometry-independent paradigm to detect free, navigable space with minimum space, and power consumption. These represent highly desirable features, especially for microaerial vehicles. In order to guarantee robust operation in real-world scenarios, the estimator is required to generalize well in diverse environments. Most of the existent depth estimators do not consider generalization, and only benchmark their performance on publicly available datasets after specific fine tuning. Generalization can be achieved by training on several heterogeneous datasets, but their collection and labeling is costly. In this letter, we propose a deep neural network for scene depth estimation that is trained on synthetic datasets, which allow inexpensive generation of ground truth data. We show how this approach is able to generalize well across different scenarios. In addition, we show how the addition of long short-term memory layers in the network helps to alleviate, in sequential image streams, some of the intrinsic limitations of monocular vision, such as global scale estimation, with low computational overhead. We demonstrate that the network is able to generalize well with respect to different real-world environments without any fine tuning, achieving comparable performance to state-of-the-art methods on the KITTI dataset.
ISSN:	2377-3766 2377-3766
DOI:	10.1109/LRA.2017.2657002