PAX-Net: Multimodal Campus Violence Detection: An AI-Powered Multi-Modal Model for Identifying and Preventing School Violence

In recent years, campus violence has become increasingly serious, affecting campus safety and negatively impacting students' mental health and academic performance. Campus violence includes physical, psychological, and sexual violence, with physical violence often accompanied by bullying and co...

Full description

Saved in:

Bibliographic Details
Published in	2024 International Conference on Artificial Intelligence and Digital Libraries (AIDL) pp. 70 - 74
Main Authors	Shen, Cheng, Zhou, Zhurong
Format	Conference Proceeding
Language	English
Published	IEEE 13.12.2024
Subjects	Accuracy Computational modeling Feature extraction Multimodal Fusion Optical losses Pose estimation Punching Refining Safety School violence Skeleton Spatiotemporal phenomena Spatiotemporal Skeletal Point Sequences Violence Detection
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In recent years, campus violence has become increasingly serious, affecting campus safety and negatively impacting students' mental health and academic performance. Campus violence includes physical, psychological, and sexual violence, with physical violence often accompanied by bullying and corporal punishment. Thus, timely detection and early warning of physical violence on campus have become important research topics. Traditional detection methods rely on images, skeleton points, skeleton sequences, and optical flow, but single-source information struggles with complex scenarios, and crowded environments often lead to skeleton point loss, reducing detection accuracy. To address these issues, this paper proposes a multimodal campus violence detection model (HMFFN), comprising a frontend multi-feature extraction module and a backend multi-feature fusion decision module. The frontend extracts skeleton sequences, human sequences, and panoramic images, combining inverted residual blocks, multi-scale atrous attention, and yolo-pose to improve detection accuracy, while Deepsort is used to track human subjects for consistent spatiotemporal information. The backend fuses features using a quasi-three-branch network, employing AD-SkeletNet, Conv3D, and DenseNet to process skeleton sequences, human sequences, and panoramic images, respectively, followed by a pyramid network to fuse features for detection.Experiments were conducted on the COCO dataset and self-constructed campus violence datasets (CVS-Video, SG-VDD, PV-HIS) for evaluation. The improved human pose estimation model showed 2.2% and 4.9% increases in AP50 and AP90 for object detection benchmarks, and 2.7% and 3.6% increases for keypoints. The multimodal fusion model outperformed the single-feature model by approximately 22.2% in accuracy, precision, recall, and F1-score, and by about 7.7% compared to the sequence-only model. The results demonstrate that the proposed model effectively addresses existing challenges and achieves promising detection and early warning performance.
DOI:	10.1109/AIDL66202.2024.00022