Depth Pooling Based Large-Scale 3-D Action Recognition With Convolutional Neural Networks

This paper proposes three simple, compact yet effective representations of depth sequences, referred to respectively as dynamic depth images (DDI), dynamic depth normal images (DDNI), and dynamic depth motion normal images (DDMNI), for both isolated and continuous action recognition. These dynamic i...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on multimedia Vol. 20; no. 5; pp. 1051 - 1061
Main Authors	Wang, Pichao, Li, Wanqing, Gao, Zhimin, Tang, Chang, Ogunbona, Philip O.
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.05.2018 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	action recognition Artificial neural networks convol-utional neural networks Datasets depth Dynamics Feature extraction Gesture recognition Image recognition Image segmentation Large-scale Motion segmentation Moving object recognition Neural networks Object recognition Representations Three-dimensional displays
Online Access	Get full text
ISSN	1520-9210 1941-0077
DOI	10.1109/TMM.2018.2818329

Cover

Loading…

More Information
Summary:	This paper proposes three simple, compact yet effective representations of depth sequences, referred to respectively as dynamic depth images (DDI), dynamic depth normal images (DDNI), and dynamic depth motion normal images (DDMNI), for both isolated and continuous action recognition. These dynamic images are constructed from a segmented sequence of depth maps using hierarchical bidirectional rank pooling to effectively capture the spatial-temporal information. Specifically, DDI exploits the dynamics of postures over time, and DDNI and DDMNI exploit the 3-D structural information captured by depth maps. Upon the proposed representations, a convolutional neural network (ConvNet)-based method is developed for action recognition. The image-based representations enable us to fine-tune the existing ConvNet models trained on image data without training a large number of parameters from scratch. The proposed method achieved the state-of-art results on three large datasets, namely, the large-scale continuous gesture recognition dataset (means the Jaccard index 0.4109), the large-scale isolated gesture recognition dataset (<inline-formula> <tex-math notation="LaTeX">\text{59.21}\%</tex-math></inline-formula>), and the NTU RGB+D dataset (<inline-formula> <tex-math notation="LaTeX">\text{87.08}\%</tex-math></inline-formula> cross-subject and <inline-formula> <tex-math notation="LaTeX">\text{84.22}\%</tex-math></inline-formula> cross-view) even though only the depth modality was used.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2018.2818329