Semantic multimodal violence detection based on local-to-global embedding
Automatic violence detection has received continuous attention due to its broad application prospects. However, most previous work prefers building a generalized pipeline while ignoring the complexity and diversity of violent scenes. In most cases, people judge violence by a variety of sub-concepts,...
Saved in:
Published in | Neurocomputing (Amsterdam) Vol. 514; pp. 148 - 161 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier B.V
01.12.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Automatic violence detection has received continuous attention due to its broad application prospects. However, most previous work prefers building a generalized pipeline while ignoring the complexity and diversity of violent scenes. In most cases, people judge violence by a variety of sub-concepts, such as blood, fighting, screams, explosions, etc., which may show certain co-occurrence trends. Therefore, we argue that parsing abstract violence into specific semantics helps to obtain the essential representation of violence. In this paper, we propose a semantic multimodal violence detection framework based on local-to-global embedding. The local semantic detection is designed to capture fine-grained violent elements in the video via a set of local semantic detectors, which is generated from a variety of external word embeddings. Also, we introduce a global semantic alignment branch to mitigate the intra-class variance of violence, in which violent video embeddings are guided to form a compact cluster while keeping a semantic gap with non-violent embeddings. Furthermore, we construct a multimodal cross-fusion network (MCN) for multimodal feature fusion, which consists of a cross-adaptive module and a cross-perceptual module. The former aims to eliminate inter-modal heterogeneity, while the latter suppresses task-irrelevant redundancies to obtain robust video representations. Extensive experiments demonstrate the effectiveness of the proposed method, which has a superior generalization capacity and achieves competitive performance on five violence datasets. |
---|---|
ISSN: | 0925-2312 1872-8286 |
DOI: | 10.1016/j.neucom.2022.09.090 |