Delving into CLIP latent space for Video Anomaly Recognition

We tackle the complex problem of detecting and recognising anomalies in surveillance videos at the frame level, utilising only video-level supervision. We introduce the novel method AnomalyCLIP, the first to combine Vision and Language Models (VLMs), such as CLIP, with multiple instance learning for...

Full description

Saved in:
Bibliographic Details
Published inComputer vision and image understanding Vol. 249; p. 104163
Main Authors Zanella, Luca, Liberatori, Benedetta, Menapace, Willi, Poiesi, Fabio, Wang, Yiming, Ricci, Elisa
Format Journal Article
LanguageEnglish
Published Elsevier Inc 01.12.2024
Subjects
Online AccessGet full text
ISSN1077-3142
DOI10.1016/j.cviu.2024.104163

Cover

Loading…
More Information
Summary:We tackle the complex problem of detecting and recognising anomalies in surveillance videos at the frame level, utilising only video-level supervision. We introduce the novel method AnomalyCLIP, the first to combine Vision and Language Models (VLMs), such as CLIP, with multiple instance learning for joint video anomaly detection and classification. Our approach specifically involves manipulating the latent CLIP feature space to identify the normal event subspace, which in turn allows us to effectively learn text-driven directions for abnormal events. When anomalous frames are projected onto these directions, they exhibit a large feature magnitude if they belong to a particular class. We also leverage a computationally efficient Transformer architecture to model short- and long-term temporal dependencies between frames, ultimately producing the final anomaly score and class prediction probabilities. We compare AnomalyCLIP against state-of-the-art methods considering three major anomaly detection benchmarks, i.e. ShanghaiTech, UCF-Crime, and XD-Violence, and empirically show that it outperforms baselines in recognising video anomalies. Project website and code are available at https://lucazanella.github.io/AnomalyCLIP/. •Vision and Language Model (VLM) for video anomaly detection and recognition.•VLM feature space transformation using normality prototype for direction learning.•A Selector model using transformed VLM space for robust abnormal segment selection.•A Temporal model capturing short-term frame relations and long-term dependencies.
ISSN:1077-3142
DOI:10.1016/j.cviu.2024.104163