Weakly Supervised Video Anomaly Detection Via Contrastive Clustering

Research output: Contribution to journalConference articlepeer-review

Abstract

- Weakly supervised video anomaly detection (WS-VAD) is challenging because it relies on video-level binary annotations to make frame-level predictions. Existing methods often convert WSVAD into a multiple instance learning (MIL) task, focusing on isolated segments that contribute most to the classification while neglecting the temporal context and detailed feature distinctions. In this paper, we propose a contrastive clustering strategy that enhances the representation of normal and abnormal features. Specifically, we treat the clustering center features and their corresponding categories as positive sample pairs, while features from different categories are treated as negative samples. This approach enables the network to better explore the distinction between normal and abnormal features. Furthermore, we address the bias in pre-trained models, where I3D pre-training features tend to overfit to normal videos and CLIP features exhibit a bias towards abnormal videos. To mitigate this, we introduce a simple early fusion method that combines pre-trained features to eliminate bias and obtain more comprehensive spatio-temporal representations. Extensive experiments on the UCF-Crime and XD-Violence datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance.

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 16 - Peace, Justice and Strong Institutions
    SDG 16 Peace, Justice and Strong Institutions

Keywords

  • Contrastive clustering
  • Feature fusion
  • Representation learning
  • Video anomaly detection

Fingerprint

Dive into the research topics of 'Weakly Supervised Video Anomaly Detection Via Contrastive Clustering'. Together they form a unique fingerprint.

Cite this