Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks

The use of RGB-D information for salient object detection (SOD) has been extensively explored in recent years. However, relatively few efforts have been put toward modeling SOD in real-world human activity scenes with RGB-D. In this article, we fill the gap by making the following contributions to R...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transaction on neural networks and learning systems Vol. 32; no. 5; pp. 2075 - 2089
Main Authors	Fan, Deng-Ping, Lin, Zheng, Zhang, Zhao, Zhu, Menglong, Cheng, Ming-Ming
Format	Journal Article
Language	English
Published	United States IEEE 01.05.2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Benchmark Benchmark testing Benchmarking Benchmarks Cameras Color Computer Systems Data models Datasets Humans Image resolution Learning Learning systems Lighting Machine Learning Measurement Neural Networks, Computer Object recognition Pattern Recognition, Automated - methods RGB-D Salience saliency salient object detection (SOD) Salient Person (SIP) data set Smart phones
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The use of RGB-D information for salient object detection (SOD) has been extensively explored in recent years. However, relatively few efforts have been put toward modeling SOD in real-world human activity scenes with RGB-D. In this article, we fill the gap by making the following contributions to RGB-D SOD: 1) we carefully collect a new S al i ent P erson (SIP) data set that consists of ~1 K high-resolution images that cover diverse real-world scenes from various viewpoints, poses, occlusions, illuminations, and background s; 2) we conduct a large-scale (and, so far, the most comprehensive) benchmark comparing contemporary methods, which has long been missing in the field and can serve as a baseline for future research, and we systematically summarize 32 popular models and evaluate 18 parts of 32 models on seven data sets containing a total of about 97k images; and 3) we propose a simple general architecture, called deep depth-depurator network (D 3 Net). It consists of a depth depurator unit (DDU) and a three-stream feature learning module (FLM), which performs low-quality depth map filtering and cross-modal feature learning, respectively. These components form a nested structure and are elaborately designed to be learned jointly. D 3 Net exceeds the performance of any prior contenders across all five metrics under consideration, thus serving as a strong model to advance research in this field. We also demonstrate that D 3 Net can be used to efficiently extract salient object masks from real scenes, enabling effective background-changing application with a speed of 65 frames/s on a single GPU. All the saliency maps, our new SIP data set, the D 3 Net model, and the evaluation tools are publicly available at https://github.com/DengPingFan/D3NetBenchmark .
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2162-237X 2162-2388 2162-2388
DOI:	10.1109/TNNLS.2020.2996406