PAX-Net: Multimodal Campus Violence Detection: An AI-Powered Multi-Modal Model for Identifying and Preventing School Violence
In recent years, campus violence has become increasingly serious, affecting campus safety and negatively impacting students' mental health and academic performance. Campus violence includes physical, psychological, and sexual violence, with physical violence often accompanied by bullying and co...
Saved in:
Published in | 2024 International Conference on Artificial Intelligence and Digital Libraries (AIDL) pp. 70 - 74 |
---|---|
Main Authors | , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
13.12.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In recent years, campus violence has become increasingly serious, affecting campus safety and negatively impacting students' mental health and academic performance. Campus violence includes physical, psychological, and sexual violence, with physical violence often accompanied by bullying and corporal punishment. Thus, timely detection and early warning of physical violence on campus have become important research topics. Traditional detection methods rely on images, skeleton points, skeleton sequences, and optical flow, but single-source information struggles with complex scenarios, and crowded environments often lead to skeleton point loss, reducing detection accuracy. To address these issues, this paper proposes a multimodal campus violence detection model (HMFFN), comprising a frontend multi-feature extraction module and a backend multi-feature fusion decision module. The frontend extracts skeleton sequences, human sequences, and panoramic images, combining inverted residual blocks, multi-scale atrous attention, and yolo-pose to improve detection accuracy, while Deepsort is used to track human subjects for consistent spatiotemporal information. The backend fuses features using a quasi-three-branch network, employing AD-SkeletNet, Conv3D, and DenseNet to process skeleton sequences, human sequences, and panoramic images, respectively, followed by a pyramid network to fuse features for detection.Experiments were conducted on the COCO dataset and self-constructed campus violence datasets (CVS-Video, SG-VDD, PV-HIS) for evaluation. The improved human pose estimation model showed 2.2% and 4.9% increases in AP50 and AP90 for object detection benchmarks, and 2.7% and 3.6% increases for keypoints. The multimodal fusion model outperformed the single-feature model by approximately 22.2% in accuracy, precision, recall, and F1-score, and by about 7.7% compared to the sequence-only model. The results demonstrate that the proposed model effectively addresses existing challenges and achieves promising detection and early warning performance. |
---|---|
DOI: | 10.1109/AIDL66202.2024.00022 |