VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval

Peng Wu, Wanshun Su, Xiangteng He, Peng Wang, Yanning Zhang

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Video anomaly retrieval (VAR) aims to retrieve pertinent abnormal or normal videos from collections of untrimmed and long videos through cross-modal requires such as textual descriptions and synchronized audios. Cross-modal pretraining (CMP) models, by pre-training on large-scale cross-modal pairs, e.g., image and text, can learn the rich associations between different modalities, and this cross-modal association capability gives CMP an advantage in conventional retrieval tasks. Inspired by this, how to utilize the robust cross-modal association capabilities of CMP in VAR to search crucial visual component from these untrimmed and long videos becomes a critical research problem. Therefore, this paper proposes a VAR method based on CMP models, named VarCMP. First, a unified hierarchical alignment strategy is proposed to constrain the semantic and spatial consistency between video and text, as well as the semantic, temporal, and spatial consistency between video and audio. It fully leverages the efficient cross-modal association capabilities of CMP models by considering cross-modal similarities at multiple granularities, enabling VarCMP to achieve effective all-round information matching for both video-text and video-audio VAR tasks. Moreover, to further solve the problem of untrimmed and long video alignment, an anomaly-biased weighting is devised in the fine-grained alignment, which identifies key segments in untrimmed long videos using anomaly priors, giving them more attention, thereby discarding irrelevant segment information, and achieving more accurate matching with cross-modal queries. Extensive experiments demonstrates high efficacy of VarCMP in both video-text and video-audio VAR tasks, achieving significant improvements on both text-video (UCFCrime-AR) and audio-video (XDViolence-AR) datasets against the best competitors by 5.0% and 5.3% R@1.

源语言英语
主期刊名Special Track on AI Alignment
编辑Toby Walsh, Julie Shah, Zico Kolter
出版商Association for the Advancement of Artificial Intelligence
8423-8431
页数9
版本8
ISBN(电子版)157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978
DOI
出版状态已出版 - 11 4月 2025
活动39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 - Philadelphia, 美国
期限: 25 2月 20254 3月 2025

出版系列

姓名Proceedings of the AAAI Conference on Artificial Intelligence
编号8
39
ISSN(印刷版)2159-5399
ISSN(电子版)2374-3468

会议

会议39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
国家/地区美国
Philadelphia
时期25/02/254/03/25

指纹

探究 'VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval' 的科研主题。它们共同构成独一无二的指纹。

引用此