Attention-based network for effective action recognition from multi-view video
A human action recognition system is affected by many challenges such as background clutter, partial occlusion, lighting, viewpoint, execution rate. Using complementary information from different views can improve view changing and occlusion problems. However, how to effectively integrate the inform...
Saved in:
Published in | Procedia computer science Vol. 192; pp. 971 - 980 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
Elsevier B.V
2021
|
Subjects | |
Online Access | Get full text |
ISSN | 1877-0509 1877-0509 |
DOI | 10.1016/j.procs.2021.08.100 |
Cover
Loading…
Summary: | A human action recognition system is affected by many challenges such as background clutter, partial occlusion, lighting, viewpoint, execution rate. Using complementary information from different views can improve view changing and occlusion problems. However, how to effectively integrate the information from multi-view images? In this paper, we propose an effective approach for multi-view human action recognition. The proposition is based on attention mechanism to pass discriminate feature between views. It is designed to form a multi-branch network whose each branch takes responsibility for extracting a view-specific feature. Furthermore, we built a cross-view attention module to enhance action recognition by transferring knowledge between views (branches). Experiments on three datasets show that the proposed solution works effectively in different scenarios. Our models have achieved the best results on two datasets (NUMA and MicaHandGesture) for both cross-subject and cross-view evaluations. On the NUMA dataset, the accuracy of our best models reach to 99.56% and 92.74% in cross-subject and cross-view evaluation scenarios respectively. And on the MicaHandGesture dataset, the accuracy are 99.06%, 91.71% in two scenarios respectively. The obtained results surpass other previous works such as Multi-Branch TSN with GRU [5] (93.81% in cross-subject evaluation, 84.4% in cross-view evaluation on the NUMA) and DA-Net [31] (92.1% for cross-subject evaluation (video-level), and 84.2% for cross-view evaluation on the NUMA dataset). We also obtained very promising results on a large-scale NTU RGB+D dataset. |
---|---|
ISSN: | 1877-0509 1877-0509 |
DOI: | 10.1016/j.procs.2021.08.100 |