Human action recognition using multi-stream attention-based deep networks with heterogeneous data from overlapping sub-actions

Vision-based Human Action Recognition is difficult owing to the variations in the same action performed by various people, the temporal variations in actions, and the difference in viewing angles. Researchers have recently adopted multi-modal visual data fusion strategies to address the limitations...

Full description

Saved in:

Bibliographic Details
Published in	Neural computing & applications Vol. 36; no. 18; pp. 10681 - 10697
Main Authors	M, Rashmi, Guddeti, Ram Mohana Reddy
Format	Journal Article
Language	English
Published	London Springer London 01.06.2024 Springer Nature B.V
Subjects	Accuracy Artificial Intelligence Computational Biology/Bioinformatics Computational Science and Engineering Computer Science Data integration Data Mining and Knowledge Discovery Datasets Deep learning Human activity recognition Human body Image Processing and Computer Vision Machine learning Methods Neural networks Original Article Probability and Statistics in Computer Science Representations Surveillance Deep learning Human action recognition Skeleton data Depth data
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Vision-based Human Action Recognition is difficult owing to the variations in the same action performed by various people, the temporal variations in actions, and the difference in viewing angles. Researchers have recently adopted multi-modal visual data fusion strategies to address the limitations of single-modality methodologies. Many researchers strive to produce more discriminative features because most existing techniques’ success relies on feature representation in the data modality under consideration. Human action consists of several sub-actions whose duration vary between individuals. This paper proposes a multifarious learning framework employing action data in depth and skeleton formats. Firstly, a novel action representation named Multiple Sub-action Enhanced Depth Motion Map (MS-EDMM), integrating depth features from overlapping sub-actions, is proposed. Secondly, an efficient method is introduced for extracting spatio-temporal features from skeleton data. This is achieved by dividing the skeleton sequence into sub-actions and summarizing skeleton joint information for five distinct human body regions. Next, a multi-stream deep learning model with Attention-guided CNN and residual LSTM is proposed for classification, followed by several score fusion operations to reap the benefits of streams trained with multiple data types. The proposed method demonstrated a superior performance of 1.62% over an existing method that utilized skeleton and depth data, achieving an accuracy 89.76% on a single-view UTD-MHAD dataset. Furthermore, on the multi-view NTU RGB+D dataset demonstrated encouraging performance with an accuracy of 89.75% in cross-view and 83.8% in cross-subject evaluations.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0941-0643 1433-3058
DOI:	10.1007/s00521-024-09630-0