Consistent constraint-based video-level learning for action recognition

This paper proposes a new neural network learning method to improve the performance for action recognition in video. Most human action recognition methods use a clip-level training strategy, which divides the video into multiple clips and trains the feature learning network by minimizing the loss fu...

Full description

Saved in:

Bibliographic Details
Published in	EURASIP journal on image and video processing Vol. 2020; no. 1; pp. 1 - 14
Main Authors	Shi, Qinghongya, Zhang, Hong-Bo, Ren, Hao-Tian, Du, Ji-Xiang, Lei, Qing
Format	Journal Article
Language	English
Published	Cham Springer International Publishing 31.08.2020 Springer Nature B.V SpringerOpen
Subjects	3D CNN Action recognition Biometrics Classification Clips Consistent constraint Datasets Engineering Human activity recognition Human motion Image Processing and Computer Vision Learning Loss function Pattern Recognition Performance enhancement Signal,Image and Speech Processing Teaching methods Training Video data Video-level learning Voting Consistent constraint 3D CNN Action recognition Loss function Video-level learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This paper proposes a new neural network learning method to improve the performance for action recognition in video. Most human action recognition methods use a clip-level training strategy, which divides the video into multiple clips and trains the feature learning network by minimizing the loss function of clip classification. The video category is predicted by the voting of clips from the same video. In order to obtain more effective action feature, a new video-level feature learning method is proposed to train 3D CNN to boost the action recognition performance. Different with clip-level training which uses clips as input, video-level learning network uses the entire video as the input. Consistent constraint loss is defined to minimize the distance between clips of the same video in voting space. Further, a video-level loss function is defined to compute the video classification error. The experimental results show that the proposed video-level training is a more effective action feature learning approach compared with the clip-level training. And this paper has achieved the state-of-the-art performance on UCF101 and HMDB51 datasets without using pre-trained models of other large-scale datasets. Our code and final model are available at https://github.com/hqu-cst-mmc/VLL .
ISSN:	1687-5281 1687-5176 1687-5281
DOI:	10.1186/s13640-020-00519-1