Attention-based network for effective action recognition from multi-view video

A human action recognition system is affected by many challenges such as background clutter, partial occlusion, lighting, viewpoint, execution rate. Using complementary information from different views can improve view changing and occlusion problems. However, how to effectively integrate the inform...

Full description

Saved in:

Bibliographic Details
Published in	Procedia computer science Vol. 192; pp. 971 - 980
Main Authors	Nguyen, Hoang-Thuyen, Nguyen, Thi-Oanh
Format	Journal Article
Language	English
Published	Elsevier B.V 2021
Subjects	cross-view cross-view attention human action recognition multi-branch multi-camera multi-view cross-view attention human action recognition multi-view multi-branch multi-camera cross-view
Online Access	Get full text
ISSN	1877-0509 1877-0509
DOI	10.1016/j.procs.2021.08.100

Cover

Loading…

More Information
Summary:	A human action recognition system is affected by many challenges such as background clutter, partial occlusion, lighting, viewpoint, execution rate. Using complementary information from different views can improve view changing and occlusion problems. However, how to effectively integrate the information from multi-view images? In this paper, we propose an effective approach for multi-view human action recognition. The proposition is based on attention mechanism to pass discriminate feature between views. It is designed to form a multi-branch network whose each branch takes responsibility for extracting a view-specific feature. Furthermore, we built a cross-view attention module to enhance action recognition by transferring knowledge between views (branches). Experiments on three datasets show that the proposed solution works effectively in different scenarios. Our models have achieved the best results on two datasets (NUMA and MicaHandGesture) for both cross-subject and cross-view evaluations. On the NUMA dataset, the accuracy of our best models reach to 99.56% and 92.74% in cross-subject and cross-view evaluation scenarios respectively. And on the MicaHandGesture dataset, the accuracy are 99.06%, 91.71% in two scenarios respectively. The obtained results surpass other previous works such as Multi-Branch TSN with GRU [5] (93.81% in cross-subject evaluation, 84.4% in cross-view evaluation on the NUMA) and DA-Net [31] (92.1% for cross-subject evaluation (video-level), and 84.2% for cross-view evaluation on the NUMA dataset). We also obtained very promising results on a large-scale NTU RGB+D dataset.
ISSN:	1877-0509 1877-0509
DOI:	10.1016/j.procs.2021.08.100