VideoLSTM convolves, attends and flows for action recognition

•To exploit both the spatial and temporal correlations in a video, we hardwire convolutions in the soft-Attention LSTM architecture.•We introduce motion-based attention which guides better the attention towards the relevant spatial-temporal locations of the actions.•We demonstrate how the attention...

Full description

Saved in:

Bibliographic Details
Published in	Computer vision and image understanding Vol. 166; pp. 41 - 50
Main Authors	Li, Zhenyang, Gavrilyuk, Kirill, Gavves, Efstratios, Jain, Mihir, Snoek, Cees G.M.
Format	Journal Article
Language	English
Published	Elsevier Inc 01.01.2018
Subjects	Action recognition Attention LSTM Video representation Action recognition LSTM Attention Video representation
Online Access	Get full text

Cover

Loading…

More Information
Summary:	•To exploit both the spatial and temporal correlations in a video, we hardwire convolutions in the soft-Attention LSTM architecture.•We introduce motion-based attention which guides better the attention towards the relevant spatial-temporal locations of the actions.•We demonstrate how the attention generated from our VideoLSTM can be used for action localization by relying on the action class label only.•We show the theoretical as well as practical merits of our VideoLSTM against other LSTM architectures for action classification and localization. We present VideoLSTM for end-to-end sequence learning of actions in video. Rather than adapting the video to the peculiarities of established recurrent or convolutional architectures, we adapt the architecture to fit the requirements of the video medium. Starting from the soft-Attention LSTM, VideoLSTM makes three novel contributions. First, video has a spatial layout. To exploit the spatial correlation we hardwire convolutions in the soft-Attention LSTM architecture. Second, motion not only informs us about the action content, but also guides better the attention towards the relevant spatio-temporal locations. We introduce motion-based attention. And finally, we demonstrate how the attention from VideoLSTM can be exploited for action localization by relying on the action class label and temporal attention smoothing. Experiments on UCF101, HMDB51 and THUMOS13 reveal the benefit of the video-specific adaptations of VideoLSTM in isolation as well as when integrated in a combined architecture. It compares favorably against other LSTM architectures for action classification and especially action localization.
ISSN:	1077-3142 1090-235X
DOI:	10.1016/j.cviu.2017.10.011