A Novel Unsupervised Video Anomaly Detection Framework Based on Optical Flow Reconstruction and Erased Frame Prediction

Reconstruction-based and prediction-based approaches are widely used for video anomaly detection (VAD) in smart city surveillance applications. However, neither of these approaches can effectively utilize the rich contextual information that exists in videos, which makes it difficult to accurately p...

Full description

Saved in:

Bibliographic Details
Published in	Sensors (Basel, Switzerland) Vol. 23; no. 10; p. 4828
Main Authors	Huang, Heqing, Zhao, Bing, Gao, Fei, Chen, Penghui, Wang, Jun, Hussain, Amir
Format	Journal Article
Language	English
Published	Switzerland MDPI AG 17.05.2023 MDPI
Subjects	Anomalies Artificial intelligence Buildings Computational linguistics Datasets Deep learning incomplete event Language processing Machine learning Natural language interfaces Natural language processing Neural networks optical flow Optical flow (image analysis) Optical memory (data storage) Reconstruction Remodeling, restoration, etc Semantics Smart cities Surveillance Surveillance equipment Unsupervised learning video anomaly detection incomplete event video anomaly detection optical flow
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Reconstruction-based and prediction-based approaches are widely used for video anomaly detection (VAD) in smart city surveillance applications. However, neither of these approaches can effectively utilize the rich contextual information that exists in videos, which makes it difficult to accurately perceive anomalous activities. In this paper, we exploit the idea of a training model based on the "Cloze Test" strategy in natural language processing (NLP) and introduce a novel unsupervised learning framework to encode both motion and appearance information at an object level. Specifically, to store the normal modes of video activity reconstructions, we first design an optical stream memory network with skip connections. Secondly, we build a space-time cube (STC) for use as the basic processing unit of the model and erase a patch in the STC to form the frame to be reconstructed. This enables a so-called "incomplete event (IE)" to be completed. On this basis, a conditional autoencoder is utilized to capture the high correspondence between optical flow and STC. The model predicts erased patches in IEs based on the context of the front and back frames. Finally, we employ a generating adversarial network (GAN)-based training method to improve the performance of VAD. By distinguishing the predicted erased optical flow and erased video frame, the anomaly detection results are shown to be more reliable with our proposed method which can help reconstruct the original video in IE. Comparative experiments conducted on the benchmark UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets demonstrate AUROC scores reaching 97.7%, 89.7%, and 75.8%, respectively.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 These authors contributed equally to this work.
ISSN:	1424-8220 1424-8220
DOI:	10.3390/s23104828