Beyond Frame-level CNN: Saliency-Aware 3-D CNN With LSTM for Video Action Recognition

Human activity recognition in videos with convolutional neural network (CNN) features has received increasing attention in multimedia understanding. Taking videos as a sequence of frames, a new record was recently set on several benchmark datasets by feeding frame-level CNN sequence features to long...

Full description

Saved in:

Bibliographic Details
Published in	IEEE signal processing letters Vol. 24; no. 4; pp. 510 - 514
Main Authors	Wang, Xuanhan, Gao, Lianli, Song, Jingkuan, Shen, Hengtao
Format	Journal Article
Language	English
Published	IEEE 01.04.2017
Subjects	Action recognition Computer architecture deep learning Image recognition LSTM Microprocessors Pipelines saliency-aware three-dimensional (3-D) convolution Three-dimensional displays Time series analysis Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Human activity recognition in videos with convolutional neural network (CNN) features has received increasing attention in multimedia understanding. Taking videos as a sequence of frames, a new record was recently set on several benchmark datasets by feeding frame-level CNN sequence features to long short-term memory (LSTM) model for video activity recognition. This recurrent model-based visual recognition pipeline is a natural choice for perceptual problems with time-varying visual input or sequential outputs. However, the above-mentioned pipeline takes frame-level CNN sequence features as input for LSTM, which may fail to capture the rich motion information from adjacent frames or maybe multiple clips. Furthermore, an activity is conducted by a subject or multiple subjects. It is important to consider attention that allows for salient features, instead of mapping an entire frame into a static representation. To tackle these issues, we propose a novel pipeline, saliency-aware three-dimensional (3-D) CNN with LSTM, for video action recognition by integrating LSTM with salient-aware deep 3-D CNN features on videos shots. Specifically, we first apply saliency-aware methods to generate saliency-aware videos. Then, we design an end-to-end pipeline by integrating 3-D CNN with LSTM, followed by a time series pooling layer and a softmax layer to predict the activities. Noticeably, we set a new record on two benchmark datasets, i.e., UCF101 with 13 320 videos and HMDB-51 with 6766 videos. Our method outperforms the state-of-the-art end-to-end methods of action recognition by 3.8% and 3.2%, respectively on above two datasets.
ISSN:	1070-9908 1558-2361
DOI:	10.1109/LSP.2016.2611485