Viewpoint invariant semantic object and scene categorization with RGB-D sensors

Understanding the semantics of objects and scenes using multi-modal RGB-D sensors serves many robotics applications. Key challenges for accurate RGB-D image recognition are the scarcity of training data, variations due to viewpoint changes and the heterogeneous nature of the data. We address these p...

Full description

Saved in:

Bibliographic Details
Published in	Autonomous robots Vol. 43; no. 4; pp. 1005 - 1022
Main Authors	Zaki, Hasan F. M., Shafait, Faisal, Mian, Ajmal
Format	Journal Article
Language	English
Published	New York Springer US 01.04.2019 Springer Nature B.V
Subjects	Artificial Intelligence Artificial neural networks Computer Imaging Control Engineering Feature extraction Hypercubes Invariants Machine learning Mechatronics Neural networks Object recognition Pattern Recognition and Graphics Performance enhancement Representations Robotics Robotics and Automation Semantics Sensors Tensors Three dimensional models Vision Multi-modal deep learning RGB-D image Object categorization Scene recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Understanding the semantics of objects and scenes using multi-modal RGB-D sensors serves many robotics applications. Key challenges for accurate RGB-D image recognition are the scarcity of training data, variations due to viewpoint changes and the heterogeneous nature of the data. We address these problems and propose a generic deep learning framework based on a pre-trained convolutional neural network, as a feature extractor for both the colour and depth channels. We propose a rich multi-scale feature representation, referred to as convolutional hypercube pyramid (HP-CNN), that is able to encode discriminative information from the convolutional tensors at different levels of detail. We also present a technique to fuse the proposed HP-CNN with the activations of fully connected neurons based on an extreme learning machine classifier in a late fusion scheme which leads to a highly discriminative and compact representation. To further improve performance, we devise HP-CNN-T which is a view-invariant descriptor extracted from a multi-view 3D object pose (M3DOP) model. M3DOP is learned from over 140,000 RGB-D images that are synthetically generated by rendering CAD models from different viewpoints. Extensive evaluations on four RGB-D object and scene recognition datasets demonstrate that our HP-CNN and HP-CNN-T consistently outperforms state-of-the-art methods for several recognition tasks by a significant margin.
ISSN:	0929-5593 1573-7527
DOI:	10.1007/s10514-018-9776-8