Multitask Deep Neural Networks for Tele-Wide Stereo Matching

In this article, we propose deep learning solutions for the estimation of the real world depth of elements in a scene captured by two cameras with different field of views. We consider a realistic smart-phone scenario, where the first field of view (FOV) is a wide FOV with <inline-formula> <...

Full description

Saved in:
Bibliographic Details
Published inIEEE access Vol. 8; pp. 184383 - 184398
Main Authors El-Khamy, Mostafa, Ren, Haoyu, Du, Xianzhi, Lee, Jungwon
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 2020
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In this article, we propose deep learning solutions for the estimation of the real world depth of elements in a scene captured by two cameras with different field of views. We consider a realistic smart-phone scenario, where the first field of view (FOV) is a wide FOV with <inline-formula> <tex-math notation="LaTeX">1 \times </tex-math></inline-formula> the optical zoom, and the second FOV is contained in the first FOV captured by a tele zoom lens with <inline-formula> <tex-math notation="LaTeX">2 \times </tex-math></inline-formula> the optical zoom. We refer to the problem of estimating the depth for all elements in the union of the FOVs which corresponds to the Wide FOV as 'tele-wide stereo matching'. Traditional approaches can only estimate the disparity or depth in the overlapped FOV, corresponding to the Tele FOV, using stereo matching algorithms. To benchmark this novel problem, we introduce a single-image inverse-depth estimation (SIDE) solution to estimate the disparity from the image corresponding to the union Wide FOV only. We also design a novel multitask tele-wide stereo matching deep neural network (MT-TW-SMNet), which is the first to combine the stereo matching and the single image depth tasks in one network. Moreover, we propose multiple methods for the fusion between the above networks. For example, we have input feature fusion, that utilizes the disparity estimated by stereo-matching as an additional input feature for SIDE. We also designed networks for decision fusion, that deploys a stacked hour glass (SHG) network for fusion and refinement of the disparity maps from both the SIDE network and the MT-TW-SMNet. These fusion schemes significantly improve the accuracy. Experimental results on KITTI and SceneFlow datasets demonstrate that our proposed approaches provide a reasonable solution to the tele-wide stereo matching problem. We demonstrate the effectiveness of our solutions in generating the Bokeh effect on the full Wide FOV.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2020.3029085