Collaborative Deconvolutional Neural Networks for Joint Depth Estimation and Semantic Segmentation

Semantic segmentation and single-view depth estimation are two fundamental problems in computer vision. They exploit the semantic and geometric properties of images, respectively, and are thus complementary in scene understanding. In this paper, we propose a collaborative deconvolutional neural netw...

Full description

Saved in:
Bibliographic Details
Published inIEEE transaction on neural networks and learning systems Vol. 29; no. 11; pp. 5655 - 5666
Main Authors Liu, Jing, Wang, Yuhang, Li, Yong, Fu, Jun, Li, Jiangyun, Lu, Hanqing
Format Journal Article
LanguageEnglish
Published United States IEEE 01.11.2018
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Semantic segmentation and single-view depth estimation are two fundamental problems in computer vision. They exploit the semantic and geometric properties of images, respectively, and are thus complementary in scene understanding. In this paper, we propose a collaborative deconvolutional neural network (C-DCNN) to jointly model these two problems for mutual promotion. The C-DCNN consists of two DCNNs, of which each is for one task. The DCNNs provide a finer resolution reconstruction method and are pretrained with hierarchical supervision. The feature maps from these two DCNNs are integrated via a pointwise bilinear layer, which fuses the semantic and depth information and produces higher order features. Then, the integrated features are fed into two sibling classification layers to simultaneously learn for semantic segmentation and depth estimation. In this way, we combine the semantic and depth features in a unified deep network and jointly train them to benefit each other. Specifically, during network training, we process depth estimation as a classification problem where a soft mapping strategy is proposed to map the continuous depth values into discrete probability distributions and the cross entropy loss is used. Besides, a fully connected conditional random field is also used as postprocessing to further improve the performance of semantic segmentation, where the proximity relations of pixels on position, intensity, and depth are jointly considered. We evaluate our approach on two challenging benchmarks: NYU Depth V2 and SUN RGB-D. It is demonstrated that our approach effectively utilizes these two kinds of information and achieves state-of-the-art results on both the semantic segmentation and depth estimation tasks.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:2162-237X
2162-2388
2162-2388
DOI:10.1109/TNNLS.2017.2787781