Instance-Aware Multi-Object Self-Supervision for Monocular Depth Prediction

This letter proposes a self-supervised monocular image-to-depth prediction framework that is trained with an end-to-end photometric loss that handles not only <inline-formula><tex-math notation="LaTeX">6-</tex-math></inline-formula>DOF camera motion but also <inl...

Full description

Saved in:

Bibliographic Details
Published in	IEEE robotics and automation letters Vol. 7; no. 4; pp. 10962 - 10968
Main Authors	Boulahbal, Houssem Eddine, Voicila, Adrian, Comport, Andrew I.
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.10.2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Artificial Intelligence Automatic Benchmarks Cameras Computer Science Computer Vision and Pattern Recognition Depth prediction Dynamics Engineering Sciences Head motion prediction multi-object detection Object motion Performance degradation Pose estimation Proposals Robotics Semantics Signal and Image processing Training Transformers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This letter proposes a self-supervised monocular image-to-depth prediction framework that is trained with an end-to-end photometric loss that handles not only <inline-formula><tex-math notation="LaTeX">6-</tex-math></inline-formula>DOF camera motion but also <inline-formula><tex-math notation="LaTeX">6-</tex-math></inline-formula>DOF moving object instances. Self-supervision is performed by warping the images across a video sequence using depth and scene motion including object instances. One novelty of the proposed method is the use of the multi-head attention of the transformer network that matches moving objects across time and models their interaction and dynamics. This enables accurate and robust pose estimation for each object instance. Most image-to-depth predication frameworks make the assumption of rigid scenes, which largely degrades their performance with respect to dynamic objects. Only a few state-of-the-art (SOTA) papers have accounted for dynamic objects. The proposed method is shown to outperform these methods on standard benchmarks and the impact of the dynamic motion on these benchmarks is exposed. Furthermore, the proposed image-to-depth prediction framework is also shown to be competitive with SOTA video-to-depth prediction frameworks.
ISSN:	2377-3766 2377-3766
DOI:	10.1109/LRA.2022.3194951