Human action recognition using two-stream attention based LSTM networks
It is well known that different frames play different roles in feature learning in video based human action recognition task. However, most existing deep learning models put the same weights on different visual and temporal cues in the parameter training stage, which severely affects the feature dis...
Saved in:
Published in | Applied soft computing Vol. 86; p. 105820 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
Elsevier B.V
01.01.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | It is well known that different frames play different roles in feature learning in video based human action recognition task. However, most existing deep learning models put the same weights on different visual and temporal cues in the parameter training stage, which severely affects the feature distinction determination. To address this problem, this paper utilizes the visual attention mechanism and proposes an end-to-end two-stream attention based LSTM network. It can selectively focus on the effective features for the original input images and pay different levels of attentions to the outputs of each deep feature maps. Moreover, considering the correlation between two deep feature streams, a deep feature correlation layer is proposed to adjust the deep learning network parameter based on the correlation judgement. In the end, we evaluate our approach on three different datasets, and the experiments results show that our proposal can achieve the state-of-the-art performance in the common scenarios.
[Display omitted]
•We proposed a two-stream attention-based LSTM architecture for action recognition in videos, which effectively resolves the visual attention ignoring problem.•Considering that the aggregation process can result in some information loss for lack of correlation information for two stream, we propose a correlation network layer which can identify the information loss on each time stamp for entire video.•Many experiments have been implemented based on three datasets and the results show that our proposal can achieve the state-of-the-art performance in the common scenarios. |
---|---|
ISSN: | 1568-4946 1872-9681 |
DOI: | 10.1016/j.asoc.2019.105820 |