VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval

Peng Wu, Wanshun Su, Xiangteng He, Peng Wang, Yanning Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Video anomaly retrieval (VAR) aims to retrieve pertinent abnormal or normal videos from collections of untrimmed and long videos through cross-modal requires such as textual descriptions and synchronized audios. Cross-modal pretraining (CMP) models, by pre-training on large-scale cross-modal pairs, e.g., image and text, can learn the rich associations between different modalities, and this cross-modal association capability gives CMP an advantage in conventional retrieval tasks. Inspired by this, how to utilize the robust cross-modal association capabilities of CMP in VAR to search crucial visual component from these untrimmed and long videos becomes a critical research problem. Therefore, this paper proposes a VAR method based on CMP models, named VarCMP. First, a unified hierarchical alignment strategy is proposed to constrain the semantic and spatial consistency between video and text, as well as the semantic, temporal, and spatial consistency between video and audio. It fully leverages the efficient cross-modal association capabilities of CMP models by considering cross-modal similarities at multiple granularities, enabling VarCMP to achieve effective all-round information matching for both video-text and video-audio VAR tasks. Moreover, to further solve the problem of untrimmed and long video alignment, an anomaly-biased weighting is devised in the fine-grained alignment, which identifies key segments in untrimmed long videos using anomaly priors, giving them more attention, thereby discarding irrelevant segment information, and achieving more accurate matching with cross-modal queries. Extensive experiments demonstrates high efficacy of VarCMP in both video-text and video-audio VAR tasks, achieving significant improvements on both text-video (UCFCrime-AR) and audio-video (XDViolence-AR) datasets against the best competitors by 5.0% and 5.3% R@1.

Original languageEnglish
Title of host publicationSpecial Track on AI Alignment
EditorsToby Walsh, Julie Shah, Zico Kolter
PublisherAssociation for the Advancement of Artificial Intelligence
Pages8423-8431
Number of pages9
Edition8
ISBN (Electronic)157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978
DOIs
StatePublished - 11 Apr 2025
Event39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 - Philadelphia, United States
Duration: 25 Feb 20254 Mar 2025

Publication series

NameProceedings of the AAAI Conference on Artificial Intelligence
Number8
Volume39
ISSN (Print)2159-5399
ISSN (Electronic)2374-3468

Conference

Conference39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
Country/TerritoryUnited States
CityPhiladelphia
Period25/02/254/03/25

Fingerprint

Dive into the research topics of 'VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval'. Together they form a unique fingerprint.

Cite this