Human action recognition using two-stream attention based LSTM networks

It is well known that different frames play different roles in feature learning in video based human action recognition task. However, most existing deep learning models put the same weights on different visual and temporal cues in the parameter training stage, which severely affects the feature dis...

Full description

Saved in:
Bibliographic Details
Published inApplied soft computing Vol. 86; p. 105820
Main Authors Dai, Cheng, Liu, Xingang, Lai, Jinfeng
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.01.2020
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:It is well known that different frames play different roles in feature learning in video based human action recognition task. However, most existing deep learning models put the same weights on different visual and temporal cues in the parameter training stage, which severely affects the feature distinction determination. To address this problem, this paper utilizes the visual attention mechanism and proposes an end-to-end two-stream attention based LSTM network. It can selectively focus on the effective features for the original input images and pay different levels of attentions to the outputs of each deep feature maps. Moreover, considering the correlation between two deep feature streams, a deep feature correlation layer is proposed to adjust the deep learning network parameter based on the correlation judgement. In the end, we evaluate our approach on three different datasets, and the experiments results show that our proposal can achieve the state-of-the-art performance in the common scenarios. [Display omitted] •We proposed a two-stream attention-based LSTM architecture for action recognition in videos, which effectively resolves the visual attention ignoring problem.•Considering that the aggregation process can result in some information loss for lack of correlation information for two stream, we propose a correlation network layer which can identify the information loss on each time stamp for entire video.•Many experiments have been implemented based on three datasets and the results show that our proposal can achieve the state-of-the-art performance in the common scenarios.
ISSN:1568-4946
1872-9681
DOI:10.1016/j.asoc.2019.105820