Video anomaly detection with spatio-temporal dissociation
•We propose a novel autoencoder architecture to dissociate the spatio temporal representation and learn the regularity in both the spatial and motion feature spaces to detect anomaly in videos.•We design an efficient motion autoencoder, which takes consecutive video frames as input and RGB differenc...
Saved in:
Published in | Pattern recognition Vol. 122; p. 108213 |
---|---|
Main Authors | , , , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Ltd
01.02.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | •We propose a novel autoencoder architecture to dissociate the spatio temporal representation and learn the regularity in both the spatial and motion feature spaces to detect anomaly in videos.•We design an efficient motion autoencoder, which takes consecutive video frames as input and RGB difference as output to imitate the movement of optical flow. The proposed method is much faster than the optical flow-based motion representation learning approach, where its average running time is 32fps.•We exploit a variance attention module to automatically assign an importance weight to the moving part of video clips, which is useful to improve the performance of the motion autoencoder.•To learn the normality in both the spatial and motion feature spaces, we concatenate these representations extracted from the two streams at the same spatial location, and optimize the two streams and the deep K-means cluster jointly with the early fusion strategy.•We fuse the spatio-temporal information with their distance from the deep K-means cluster in the pixel level to calculate the anomaly score. Compared with our prior frame level fusion scheme, experimental results show that the performance of the new architecture is improved.
Anomaly detection in videos remains a challenging task due to the ambiguous definition of anomaly and the complexity of visual scenes from real video data. Different from the previous work which utilizes reconstruction or prediction as an auxiliary task to learn the temporal regularity, in this work, we explore a novel convolution autoencoder architecture that can dissociate the spatio-temporal representation to separately capture the spatial and the temporal information, since abnormal events are usually different from the normality in appearance and/or motion behavior. Specifically, the spatial autoencoder models the normality on the appearance feature space by learning to reconstruct the input of the first individual frame (FIF), while the temporal part takes the first four consecutive frames as the input and the RGB difference as the output to simulate the motion of optical flow in an efficient way. The abnormal events, which are irregular in appearance or in motion behavior, lead to a large reconstruction error. To improve detection performance on fast moving outliers, we exploit a variance-based attention module and insert it into the motion autoencoder to highlight large movement areas. In addition, we propose a deep K-means cluster strategy to force the spatial and the motion encoder to extract a compact representation. Extensive experiments on some publicly available datasets have demonstrated the effectiveness of our method which achieves the state-of-the-art performance. The code is publicly released at the link11https://github.com/ChangYunPeng/VideoAnomalyDetection. |
---|---|
ISSN: | 0031-3203 1873-5142 |
DOI: | 10.1016/j.patcog.2021.108213 |