Lightweight Violence Detection Model Based on 2D CNN with Bi-Directional Motion Attention

With the widespread deployment of surveillance cameras, automatic violence detection has attracted extensive attention from industry and academia. Though researchers have made great progress in video-based violence detection, it is still a challenging task to realize accurate violence detection in r...

Full description

Saved in:

Bibliographic Details
Published in	Applied sciences Vol. 14; no. 11; p. 4895
Main Authors	Wang, Jingwen, Zhao, Daqi, Li, Haoming, Wang, Deqiang
Format	Journal Article
Language	English
Published	Basel MDPI AG 01.06.2024
Subjects	Accuracy Algorithms attention mechanism Behavior Cameras Classification Datasets lightweight Neural networks Performance evaluation Support vector machines Surveillance temporal modeling Violence violence detection
Online Access	Get full text

Cover

Loading…

More Information
Summary:	With the widespread deployment of surveillance cameras, automatic violence detection has attracted extensive attention from industry and academia. Though researchers have made great progress in video-based violence detection, it is still a challenging task to realize accurate violence detection in real time, especially with limited computing resources. In this paper, we propose a lightweight 2D CNN-based violence detection scheme, which takes advantage of frame-grouping to reduce data redundancy greatly and, meanwhile, enable short-term temporal modeling. In particular, a lightweight 2D CNN, named improved EfficientNet-B0, is constructed by integrating our proposed bi-directional long-term motion attention (Bi-LTMA) module and a temporal shift module (TSM) into the original EfficientNet-B0. The Bi-LTMA takes both spatial and channel dimensions into consideration and captures motion features in both forward and backward directions. The TSM is adopted to realize temporal feature interaction. Moreover, an auxiliary classifier is designed and employed to improve the classification capability and generalization performance of the proposed model. Experiment results demonstrate that the computational cost of the proposed model is 1.21 GFLOPS. Moreover, the proposed scheme achieves accuracies of 100%, 98.5%, 91.67%, and 90.25% on the Movie Fight dataset, the Hockey Fight dataset, the Surveillance Camera dataset, and the RWF-2000 dataset, respectively.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2076-3417 2076-3417
DOI:	10.3390/app14114895