VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval

Peng Wu; Wanshun Su; Xiangteng He; Peng Wang; Yanning Zhang

doi:10.1609/aaai.v39i8.32909

VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval

Peng Wu, Wanshun Su, Xiangteng He, Peng Wang, Yanning Zhang

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

摘要

Video anomaly retrieval (VAR) aims to retrieve pertinent abnormal or normal videos from collections of untrimmed and long videos through cross-modal requires such as textual descriptions and synchronized audios. Cross-modal pretraining (CMP) models, by pre-training on large-scale cross-modal pairs, e.g., image and text, can learn the rich associations between different modalities, and this cross-modal association capability gives CMP an advantage in conventional retrieval tasks. Inspired by this, how to utilize the robust cross-modal association capabilities of CMP in VAR to search crucial visual component from these untrimmed and long videos becomes a critical research problem. Therefore, this paper proposes a VAR method based on CMP models, named VarCMP. First, a unified hierarchical alignment strategy is proposed to constrain the semantic and spatial consistency between video and text, as well as the semantic, temporal, and spatial consistency between video and audio. It fully leverages the efficient cross-modal association capabilities of CMP models by considering cross-modal similarities at multiple granularities, enabling VarCMP to achieve effective all-round information matching for both video-text and video-audio VAR tasks. Moreover, to further solve the problem of untrimmed and long video alignment, an anomaly-biased weighting is devised in the fine-grained alignment, which identifies key segments in untrimmed long videos using anomaly priors, giving them more attention, thereby discarding irrelevant segment information, and achieving more accurate matching with cross-modal queries. Extensive experiments demonstrates high efficacy of VarCMP in both video-text and video-audio VAR tasks, achieving significant improvements on both text-video (UCFCrime-AR) and audio-video (XDViolence-AR) datasets against the best competitors by 5.0% and 5.3% R@1.

源语言	英语
主期刊名	Special Track on AI Alignment
编辑	Toby Walsh, Julie Shah, Zico Kolter
出版商	Association for the Advancement of Artificial Intelligence
页	8423-8431
页数	9
版本	8
ISBN（电子版）	157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978
DOI	https://doi.org/10.1609/aaai.v39i8.32909
出版状态	已出版 - 11 4月 2025
活动	39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 - Philadelphia, 美国期限: 25 2月 2025 → 4 3月 2025

出版系列

姓名	Proceedings of the AAAI Conference on Artificial Intelligence
编号	8
卷	39
ISSN（印刷版）	2159-5399
ISSN（电子版）	2374-3468

会议

会议	39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
国家/地区	美国
市	Philadelphia
时期	25/02/25 → 4/03/25

访问文件

10.1609/aaai.v39i8.32909

其它文件与链接

链接到 Scopus 的出版物

引用此

Wu, P., Su, W., He, X., Wang, P., & Zhang, Y. (2025). VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval. 在 T. Walsh, J. Shah, & Z. Kolter (编辑), Special Track on AI Alignment (8 编辑, 页码 8423-8431). (Proceedings of the AAAI Conference on Artificial Intelligence; 卷 39, 号码 8). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v39i8.32909

@inproceedings{c9b1cec733874ffeac4fac6a130b0d82,

title = "VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval",

abstract = "Video anomaly retrieval (VAR) aims to retrieve pertinent abnormal or normal videos from collections of untrimmed and long videos through cross-modal requires such as textual descriptions and synchronized audios. Cross-modal pretraining (CMP) models, by pre-training on large-scale cross-modal pairs, e.g., image and text, can learn the rich associations between different modalities, and this cross-modal association capability gives CMP an advantage in conventional retrieval tasks. Inspired by this, how to utilize the robust cross-modal association capabilities of CMP in VAR to search crucial visual component from these untrimmed and long videos becomes a critical research problem. Therefore, this paper proposes a VAR method based on CMP models, named VarCMP. First, a unified hierarchical alignment strategy is proposed to constrain the semantic and spatial consistency between video and text, as well as the semantic, temporal, and spatial consistency between video and audio. It fully leverages the efficient cross-modal association capabilities of CMP models by considering cross-modal similarities at multiple granularities, enabling VarCMP to achieve effective all-round information matching for both video-text and video-audio VAR tasks. Moreover, to further solve the problem of untrimmed and long video alignment, an anomaly-biased weighting is devised in the fine-grained alignment, which identifies key segments in untrimmed long videos using anomaly priors, giving them more attention, thereby discarding irrelevant segment information, and achieving more accurate matching with cross-modal queries. Extensive experiments demonstrates high efficacy of VarCMP in both video-text and video-audio VAR tasks, achieving significant improvements on both text-video (UCFCrime-AR) and audio-video (XDViolence-AR) datasets against the best competitors by 5.0% and 5.3% R@1.",

author = "Peng Wu and Wanshun Su and Xiangteng He and Peng Wang and Yanning Zhang",

note = "Publisher Copyright: Copyright {\textcopyright} 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.; 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 ; Conference date: 25-02-2025 Through 04-03-2025",

year = "2025",

month = apr,

day = "11",

doi = "10.1609/aaai.v39i8.32909",

language = "英语",

series = "Proceedings of the AAAI Conference on Artificial Intelligence",

publisher = "Association for the Advancement of Artificial Intelligence",

number = "8",

pages = "8423--8431",

editor = "Toby Walsh and Julie Shah and Zico Kolter",

booktitle = "Special Track on AI Alignment",

edition = "8",

}

Wu, P, Su, W, He, X, Wang, P & Zhang, Y 2025, VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval. 在 T Walsh, J Shah & Z Kolter (编辑), Special Track on AI Alignment. 8 编辑, Proceedings of the AAAI Conference on Artificial Intelligence, 号码 8, 卷 39, Association for the Advancement of Artificial Intelligence, 页码 8423-8431, 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025, Philadelphia, 美国, 25/02/25. https://doi.org/10.1609/aaai.v39i8.32909

VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval. / Wu, Peng; Su, Wanshun; He, Xiangteng 等.
Special Track on AI Alignment. 编辑 / Toby Walsh; Julie Shah; Zico Kolter. 8. 编辑 Association for the Advancement of Artificial Intelligence, 2025. 页码 8423-8431 (Proceedings of the AAAI Conference on Artificial Intelligence; 卷 39, 号码 8).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - VarCMP

T2 - 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025

AU - Wu, Peng

AU - Su, Wanshun

AU - He, Xiangteng

AU - Wang, Peng

AU - Zhang, Yanning

PY - 2025/4/11

Y1 - 2025/4/11

N2 - Video anomaly retrieval (VAR) aims to retrieve pertinent abnormal or normal videos from collections of untrimmed and long videos through cross-modal requires such as textual descriptions and synchronized audios. Cross-modal pretraining (CMP) models, by pre-training on large-scale cross-modal pairs, e.g., image and text, can learn the rich associations between different modalities, and this cross-modal association capability gives CMP an advantage in conventional retrieval tasks. Inspired by this, how to utilize the robust cross-modal association capabilities of CMP in VAR to search crucial visual component from these untrimmed and long videos becomes a critical research problem. Therefore, this paper proposes a VAR method based on CMP models, named VarCMP. First, a unified hierarchical alignment strategy is proposed to constrain the semantic and spatial consistency between video and text, as well as the semantic, temporal, and spatial consistency between video and audio. It fully leverages the efficient cross-modal association capabilities of CMP models by considering cross-modal similarities at multiple granularities, enabling VarCMP to achieve effective all-round information matching for both video-text and video-audio VAR tasks. Moreover, to further solve the problem of untrimmed and long video alignment, an anomaly-biased weighting is devised in the fine-grained alignment, which identifies key segments in untrimmed long videos using anomaly priors, giving them more attention, thereby discarding irrelevant segment information, and achieving more accurate matching with cross-modal queries. Extensive experiments demonstrates high efficacy of VarCMP in both video-text and video-audio VAR tasks, achieving significant improvements on both text-video (UCFCrime-AR) and audio-video (XDViolence-AR) datasets against the best competitors by 5.0% and 5.3% R@1.

AB - Video anomaly retrieval (VAR) aims to retrieve pertinent abnormal or normal videos from collections of untrimmed and long videos through cross-modal requires such as textual descriptions and synchronized audios. Cross-modal pretraining (CMP) models, by pre-training on large-scale cross-modal pairs, e.g., image and text, can learn the rich associations between different modalities, and this cross-modal association capability gives CMP an advantage in conventional retrieval tasks. Inspired by this, how to utilize the robust cross-modal association capabilities of CMP in VAR to search crucial visual component from these untrimmed and long videos becomes a critical research problem. Therefore, this paper proposes a VAR method based on CMP models, named VarCMP. First, a unified hierarchical alignment strategy is proposed to constrain the semantic and spatial consistency between video and text, as well as the semantic, temporal, and spatial consistency between video and audio. It fully leverages the efficient cross-modal association capabilities of CMP models by considering cross-modal similarities at multiple granularities, enabling VarCMP to achieve effective all-round information matching for both video-text and video-audio VAR tasks. Moreover, to further solve the problem of untrimmed and long video alignment, an anomaly-biased weighting is devised in the fine-grained alignment, which identifies key segments in untrimmed long videos using anomaly priors, giving them more attention, thereby discarding irrelevant segment information, and achieving more accurate matching with cross-modal queries. Extensive experiments demonstrates high efficacy of VarCMP in both video-text and video-audio VAR tasks, achieving significant improvements on both text-video (UCFCrime-AR) and audio-video (XDViolence-AR) datasets against the best competitors by 5.0% and 5.3% R@1.

UR - http://www.scopus.com/inward/record.url?scp=105004324598&partnerID=8YFLogxK

U2 - 10.1609/aaai.v39i8.32909

DO - 10.1609/aaai.v39i8.32909

M3 - 会议稿件

AN - SCOPUS:105004324598

T3 - Proceedings of the AAAI Conference on Artificial Intelligence

SP - 8423

EP - 8431

BT - Special Track on AI Alignment

A2 - Walsh, Toby

A2 - Shah, Julie

A2 - Kolter, Zico

PB - Association for the Advancement of Artificial Intelligence

Y2 - 25 February 2025 through 4 March 2025

ER -

Wu P, Su W, He X, Wang P, Zhang Y. VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval. 在 Walsh T, Shah J, Kolter Z, 编辑, Special Track on AI Alignment. 8 编辑 Association for the Advancement of Artificial Intelligence. 2025. 页码 8423-8431. (Proceedings of the AAAI Conference on Artificial Intelligence; 8). doi: 10.1609/aaai.v39i8.32909

VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此