Forecasting of depth and ego-motion with transformers and self-supervision

This paper addresses the problem of end-to-end self-supervised forecasting of depth and ego motion. Given a sequence of raw images, the aim is to forecast both the geometry and ego-motion using a self supervised photometric loss. The architecture is designed using both convolution and transformer mo...

Full description

Saved in:
Bibliographic Details
Published inarXiv.org
Main Authors Boulahbal, Houssem, Voicila, Adrian, Comport, Andrew
Format Paper
LanguageEnglish
Published Ithaca Cornell University Library, arXiv.org 15.06.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:This paper addresses the problem of end-to-end self-supervised forecasting of depth and ego motion. Given a sequence of raw images, the aim is to forecast both the geometry and ego-motion using a self supervised photometric loss. The architecture is designed using both convolution and transformer modules. This leverages the benefits of both modules: Inductive bias of CNN, and the multi-head attention of transformers, thus enabling a rich spatio-temporal representation that enables accurate depth forecasting. Prior work attempts to solve this problem using multi-modal input/output with supervised ground-truth data which is not practical since a large annotated dataset is required. Alternatively to prior methods, this paper forecasts depth and ego motion using only self-supervised raw images as input. The approach performs significantly well on the KITTI dataset benchmark with several performance criteria being even comparable to prior non-forecasting self-supervised monocular depth inference methods.
ISSN:2331-8422