TY - JOUR
T1 - Scene-Dependent Prediction in Latent Space for Video Anomaly Detection and Anticipation
AU - Cao, Congqi
AU - Zhang, Hanwen
AU - Lu, Yue
AU - Wang, Peng
AU - Zhang, Yanning
N1 - Publisher Copyright:
© 1979-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Video anomaly detection (VAD) plays a crucial role in intelligent surveillance. However, an essential type of anomaly named scene-dependent anomaly is overlooked. Moreover, the task of video anomaly anticipation (VAA) also deserves attention. To fill these gaps, we build a comprehensive dataset named NWPU Campus, which is the largest semi-supervised VAD dataset and the first dataset for scene-dependent VAD and VAA. Meanwhile, we introduce a novel forward-backward framework for scene-dependent VAD and VAA, in which the forward network individually solves the VAD and jointly solves the VAA with the backward network. Particularly, we propose a scene-dependent generative model in latent space for the forward and backward networks. First, we propose a hierarchical variational auto-encoder to extract scene-generic features. Next, we design a score-based diffusion model in latent space to refine these features more compact for the task and generate scene-dependent features with a scene information auto-encoder, modeling the relationships between video events and scenes. Finally, we develop a temporal loss from key frames to constrain the motion consistency of video clips. Extensive experiments demonstrate that our method can handle both scene-dependent anomaly detection and anticipation well, achieving state-of-the-art performance on ShanghaiTech, CUHK Avenue, and the proposed NWPU Campus datasets.
AB - Video anomaly detection (VAD) plays a crucial role in intelligent surveillance. However, an essential type of anomaly named scene-dependent anomaly is overlooked. Moreover, the task of video anomaly anticipation (VAA) also deserves attention. To fill these gaps, we build a comprehensive dataset named NWPU Campus, which is the largest semi-supervised VAD dataset and the first dataset for scene-dependent VAD and VAA. Meanwhile, we introduce a novel forward-backward framework for scene-dependent VAD and VAA, in which the forward network individually solves the VAD and jointly solves the VAA with the backward network. Particularly, we propose a scene-dependent generative model in latent space for the forward and backward networks. First, we propose a hierarchical variational auto-encoder to extract scene-generic features. Next, we design a score-based diffusion model in latent space to refine these features more compact for the task and generate scene-dependent features with a scene information auto-encoder, modeling the relationships between video events and scenes. Finally, we develop a temporal loss from key frames to constrain the motion consistency of video clips. Extensive experiments demonstrate that our method can handle both scene-dependent anomaly detection and anticipation well, achieving state-of-the-art performance on ShanghaiTech, CUHK Avenue, and the proposed NWPU Campus datasets.
KW - Scene-dependent anomaly
KW - diffusion models
KW - prediction network
KW - video anomaly detection and anticipation
UR - http://www.scopus.com/inward/record.url?scp=85204453216&partnerID=8YFLogxK
U2 - 10.1109/TPAMI.2024.3461718
DO - 10.1109/TPAMI.2024.3461718
M3 - 文章
AN - SCOPUS:85204453216
SN - 0162-8828
VL - 47
SP - 224
EP - 239
JO - IEEE Transactions on Pattern Analysis and Machine Intelligence
JF - IEEE Transactions on Pattern Analysis and Machine Intelligence
IS - 1
ER -