Semantic multimodal violence detection based on local-to-global embedding

Automatic violence detection has received continuous attention due to its broad application prospects. However, most previous work prefers building a generalized pipeline while ignoring the complexity and diversity of violent scenes. In most cases, people judge violence by a variety of sub-concepts,...

Full description

Saved in:

Bibliographic Details
Published in	Neurocomputing (Amsterdam) Vol. 514; pp. 148 - 161
Main Authors	Pu, Yujiang, Wu, Xiaoyu, Wang, Shengjin, Huang, Yuming, Liu, Zihao, Gu, Chaonan
Format	Journal Article
Language	English
Published	Elsevier B.V 01.12.2022
Subjects	Deep learning Multimodal fusion Semantic embedding Violence detection Deep learning Multimodal fusion Violence detection Semantic embedding
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Automatic violence detection has received continuous attention due to its broad application prospects. However, most previous work prefers building a generalized pipeline while ignoring the complexity and diversity of violent scenes. In most cases, people judge violence by a variety of sub-concepts, such as blood, fighting, screams, explosions, etc., which may show certain co-occurrence trends. Therefore, we argue that parsing abstract violence into specific semantics helps to obtain the essential representation of violence. In this paper, we propose a semantic multimodal violence detection framework based on local-to-global embedding. The local semantic detection is designed to capture fine-grained violent elements in the video via a set of local semantic detectors, which is generated from a variety of external word embeddings. Also, we introduce a global semantic alignment branch to mitigate the intra-class variance of violence, in which violent video embeddings are guided to form a compact cluster while keeping a semantic gap with non-violent embeddings. Furthermore, we construct a multimodal cross-fusion network (MCN) for multimodal feature fusion, which consists of a cross-adaptive module and a cross-perceptual module. The former aims to eliminate inter-modal heterogeneity, while the latter suppresses task-irrelevant redundancies to obtain robust video representations. Extensive experiments demonstrate the effectiveness of the proposed method, which has a superior generalization capacity and achieves competitive performance on five violence datasets.
ISSN:	0925-2312 1872-8286
DOI:	10.1016/j.neucom.2022.09.090