First-Person Action Recognition With Temporal Pooling and Hilbert-Huang Transform

This paper presents a convolutional neural network (CNN)-based approach for first-person action recognition with a combination of temporal pooling and the Hilbert-Huang transform (HHT). The new approach first adaptively performs temporal sub-action localization, treats each channel of the extracted...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on multimedia Vol. 21; no. 12; pp. 3122 - 3135
Main Authors Purwanto, Didik, Chen, Yie-Tarng, Fang, Wen-Hsien
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 01.12.2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:This paper presents a convolutional neural network (CNN)-based approach for first-person action recognition with a combination of temporal pooling and the Hilbert-Huang transform (HHT). The new approach first adaptively performs temporal sub-action localization, treats each channel of the extracted trajectory pooled CNN features as a time series, and summarizes the temporal dynamic information in each sub-action by temporal pooling. The temporal evolution across sub-actions is then modeled by rank pooling. Thereafter, to account for the highly dynamic scene changes in first-person videos, the HHT is employed to decompose the ranked pooling features into finite and often few data-dependent functions, called intrinsic mode functions (IMFs), through empirical mode decomposition. Hilbert spectral analysis is then applied to each IMF component, and four salient descriptors are scrutinized and aggregated into the final video descriptor. Such a framework cannot only precisely acquire both long- and short-term tendencies, but also address the cumbersome significant camera motion in first-person videos to render better accuracy. Furthermore, it works well for complex actions for limited training samples. Simulations show that the proposed approach outperforms the main state-of-the-art methods when applied to four publicly available first-person video datasets.
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2019.2919434