First-Person Action Recognition With Temporal Pooling and Hilbert-Huang Transform

This paper presents a convolutional neural network (CNN)-based approach for first-person action recognition with a combination of temporal pooling and the Hilbert-Huang transform (HHT). The new approach first adaptively performs temporal sub-action localization, treats each channel of the extracted...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on multimedia Vol. 21; no. 12; pp. 3122 - 3135
Main Authors	Purwanto, Didik, Chen, Yie-Tarng, Fang, Wen-Hsien
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.12.2019 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	action recognition Artificial neural networks Computer simulation Convolutional neural networks Empirical analysis Empirical mode decomposition Feature extraction first-person videos Hilbert-Huang transform Moving object recognition Recognition Spectrum analysis temporal aggregation temporal pooling Transforms video descriptor
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This paper presents a convolutional neural network (CNN)-based approach for first-person action recognition with a combination of temporal pooling and the Hilbert-Huang transform (HHT). The new approach first adaptively performs temporal sub-action localization, treats each channel of the extracted trajectory pooled CNN features as a time series, and summarizes the temporal dynamic information in each sub-action by temporal pooling. The temporal evolution across sub-actions is then modeled by rank pooling. Thereafter, to account for the highly dynamic scene changes in first-person videos, the HHT is employed to decompose the ranked pooling features into finite and often few data-dependent functions, called intrinsic mode functions (IMFs), through empirical mode decomposition. Hilbert spectral analysis is then applied to each IMF component, and four salient descriptors are scrutinized and aggregated into the final video descriptor. Such a framework cannot only precisely acquire both long- and short-term tendencies, but also address the cumbersome significant camera motion in first-person videos to render better accuracy. Furthermore, it works well for complex actions for limited training samples. Simulations show that the proposed approach outperforms the main state-of-the-art methods when applied to four publicly available first-person video datasets.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2019.2919434