VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry

Semantic understanding and localization are fundamental enablers of robot autonomy that have been tackled as disjoint problems for the most part. While deep learning has enabled recent breakthroughs across a wide spectrum of scene understanding tasks, its applicability to state estimation tasks has...

Full description

Saved in:

Bibliographic Details
Published in	IEEE robotics and automation letters Vol. 3; no. 4; pp. 4407 - 4414
Main Authors	Radwan, Noha, Valada, Abhinav, Burgard, Wolfram
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.10.2018 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Adaptation models Autonomy Datasets Deep learning in robotics and automation Image segmentation Localization Machine learning Motion segmentation Odometers Scene analysis Segmentation Semantics State estimation Streaming media Task analysis visual learning Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Semantic understanding and localization are fundamental enablers of robot autonomy that have been tackled as disjoint problems for the most part. While deep learning has enabled recent breakthroughs across a wide spectrum of scene understanding tasks, its applicability to state estimation tasks has been limited due to the direct formulation that renders it incapable of encoding scene-specific constrains. In this letter, we propose the VLocNet++ architecture that employs a multitask learning approach to exploit the inter-task relationship between learning semantics, regressing 6-DoF global pose and odometry, for the mutual benefit of each of these tasks. Our network overcomes the aforementioned limitation by simultaneously embedding geometric and semantic knowledge of the world into the pose regression network. We propose a novel adaptive weighted fusion layer to aggregate motion-specific temporal information and to fuse semantic features into the localization stream based on region activations. Furthermore, we propose a self-supervised warping technique that uses the relative motion to warp intermediate network representations in the segmentation stream for learning consistent semantics. Finally, we introduce a first-of-a-kind urban outdoor localization dataset with pixel-level semantic labels and multiple loops for training deep networks. Extensive experiments on the challenging Microsoft 7-Scenes benchmark and our DeepLoc dataset demonstrate that our approach exceeds the state-of-the-art outperforming local feature-based methods while simultaneously performing multiple tasks and exhibiting substantial robustness in challenging scenarios.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2377-3766 2377-3766
DOI:	10.1109/LRA.2018.2869640