Scene-Dependent Prediction in Latent Space for Video Anomaly Detection and Anticipation

Congqi Cao, Hanwen Zhang, Yue Lu, Peng Wang, Yanning Zhang

Research output: Contribution to journalArticlepeer-review

6 Scopus citations

Abstract

Video anomaly detection (VAD) plays a crucial role in intelligent surveillance. However, an essential type of anomaly named scene-dependent anomaly is overlooked. Moreover, the task of video anomaly anticipation (VAA) also deserves attention. To fill these gaps, we build a comprehensive dataset named NWPU Campus, which is the largest semi-supervised VAD dataset and the first dataset for scene-dependent VAD and VAA. Meanwhile, we introduce a novel forward-backward framework for scene-dependent VAD and VAA, in which the forward network individually solves the VAD and jointly solves the VAA with the backward network. Particularly, we propose a scene-dependent generative model in latent space for the forward and backward networks. First, we propose a hierarchical variational auto-encoder to extract scene-generic features. Next, we design a score-based diffusion model in latent space to refine these features more compact for the task and generate scene-dependent features with a scene information auto-encoder, modeling the relationships between video events and scenes. Finally, we develop a temporal loss from key frames to constrain the motion consistency of video clips. Extensive experiments demonstrate that our method can handle both scene-dependent anomaly detection and anticipation well, achieving state-of-the-art performance on ShanghaiTech, CUHK Avenue, and the proposed NWPU Campus datasets.

Original languageEnglish
Pages (from-to)224-239
Number of pages16
JournalIEEE Transactions on Pattern Analysis and Machine Intelligence
Volume47
Issue number1
DOIs
StatePublished - 2025

Keywords

  • Scene-dependent anomaly
  • diffusion models
  • prediction network
  • video anomaly detection and anticipation

Fingerprint

Dive into the research topics of 'Scene-Dependent Prediction in Latent Space for Video Anomaly Detection and Anticipation'. Together they form a unique fingerprint.

Cite this