Audio–visual representation learning for anomaly events detection in crowds

Junyu Gao; Hao Yang; Maoguo Gong; Xuelong Li

doi:10.1016/j.neucom.2024.127489

Audio–visual representation learning for anomaly events detection in crowds

Junyu Gao, Hao Yang, Maoguo Gong, Xuelong Li

光电与智能研究院

科研成果: 期刊稿件 › 文章 › 同行评审

11 引用（Scopus）

摘要

In recent years, anomaly events detection in crowd scenes attracts many researchers’ attentions, because of its importance to public safety. Existing methods usually exploit visual information to analyze whether any abnormal events have occurred due to only visual sensors are generally equipped in public places. However, when an abnormal event in crowds occurs, sound information may be discriminative to assist the crowd analysis system to determine whether there is an abnormality. Compared with vision information that is easily occluded, audio signals have a certain degree of penetration. Thus, this paper attempt to exploit multi-modal learning for modeling the audio and visual signals simultaneously. To be specific, we design a two-branch network to model different types of information. The first is a typical 3D CNN model to extract temporal appearance feature from video clips. The second is an audio CNN for encoding Log Mel-Spectrogram of audio signals. Finally, by fusing the above features, the more accurate prediction will be produced. We conduct the experiments on SHADE dataset, a synthetic audio–visual dataset in surveillance scenes, and find introducing audio signals effectively improves the performance of anomaly events detection and outperforms other state-of-the-art methods. Furthermore, we will release the code and the pre-trained models as soon as possible.

源语言	英语
文章编号	127489
期刊	Neurocomputing
卷	582
DOI	https://doi.org/10.1016/j.neucom.2024.127489
出版状态	已出版 - 14 5月 2024

访问文件

10.1016/j.neucom.2024.127489

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{b5fd78ae89f7491fb7093bf4b239b5e4,

title = "Audio–visual representation learning for anomaly events detection in crowds",

abstract = "In recent years, anomaly events detection in crowd scenes attracts many researchers{\textquoteright} attentions, because of its importance to public safety. Existing methods usually exploit visual information to analyze whether any abnormal events have occurred due to only visual sensors are generally equipped in public places. However, when an abnormal event in crowds occurs, sound information may be discriminative to assist the crowd analysis system to determine whether there is an abnormality. Compared with vision information that is easily occluded, audio signals have a certain degree of penetration. Thus, this paper attempt to exploit multi-modal learning for modeling the audio and visual signals simultaneously. To be specific, we design a two-branch network to model different types of information. The first is a typical 3D CNN model to extract temporal appearance feature from video clips. The second is an audio CNN for encoding Log Mel-Spectrogram of audio signals. Finally, by fusing the above features, the more accurate prediction will be produced. We conduct the experiments on SHADE dataset, a synthetic audio–visual dataset in surveillance scenes, and find introducing audio signals effectively improves the performance of anomaly events detection and outperforms other state-of-the-art methods. Furthermore, we will release the code and the pre-trained models as soon as possible.",

keywords = "Anomaly events detection, Audio–visual representation learning, Crowd analysis, Multi-modal learning",

author = "Junyu Gao and Hao Yang and Maoguo Gong and Xuelong Li",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier B.V.",

year = "2024",

month = may,

day = "14",

doi = "10.1016/j.neucom.2024.127489",

language = "英语",

volume = "582",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Audio–visual representation learning for anomaly events detection in crowds

AU - Gao, Junyu

AU - Yang, Hao

AU - Gong, Maoguo

AU - Li, Xuelong

PY - 2024/5/14

Y1 - 2024/5/14

N2 - In recent years, anomaly events detection in crowd scenes attracts many researchers’ attentions, because of its importance to public safety. Existing methods usually exploit visual information to analyze whether any abnormal events have occurred due to only visual sensors are generally equipped in public places. However, when an abnormal event in crowds occurs, sound information may be discriminative to assist the crowd analysis system to determine whether there is an abnormality. Compared with vision information that is easily occluded, audio signals have a certain degree of penetration. Thus, this paper attempt to exploit multi-modal learning for modeling the audio and visual signals simultaneously. To be specific, we design a two-branch network to model different types of information. The first is a typical 3D CNN model to extract temporal appearance feature from video clips. The second is an audio CNN for encoding Log Mel-Spectrogram of audio signals. Finally, by fusing the above features, the more accurate prediction will be produced. We conduct the experiments on SHADE dataset, a synthetic audio–visual dataset in surveillance scenes, and find introducing audio signals effectively improves the performance of anomaly events detection and outperforms other state-of-the-art methods. Furthermore, we will release the code and the pre-trained models as soon as possible.

AB - In recent years, anomaly events detection in crowd scenes attracts many researchers’ attentions, because of its importance to public safety. Existing methods usually exploit visual information to analyze whether any abnormal events have occurred due to only visual sensors are generally equipped in public places. However, when an abnormal event in crowds occurs, sound information may be discriminative to assist the crowd analysis system to determine whether there is an abnormality. Compared with vision information that is easily occluded, audio signals have a certain degree of penetration. Thus, this paper attempt to exploit multi-modal learning for modeling the audio and visual signals simultaneously. To be specific, we design a two-branch network to model different types of information. The first is a typical 3D CNN model to extract temporal appearance feature from video clips. The second is an audio CNN for encoding Log Mel-Spectrogram of audio signals. Finally, by fusing the above features, the more accurate prediction will be produced. We conduct the experiments on SHADE dataset, a synthetic audio–visual dataset in surveillance scenes, and find introducing audio signals effectively improves the performance of anomaly events detection and outperforms other state-of-the-art methods. Furthermore, we will release the code and the pre-trained models as soon as possible.

KW - Anomaly events detection

KW - Audio–visual representation learning

KW - Crowd analysis

KW - Multi-modal learning

UR - http://www.scopus.com/inward/record.url?scp=85188596248&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2024.127489

DO - 10.1016/j.neucom.2024.127489

M3 - 文章

AN - SCOPUS:85188596248

SN - 0925-2312

VL - 582

JO - Neurocomputing

JF - Neurocomputing

M1 - 127489

ER -

Audio–visual representation learning for anomaly events detection in crowds

摘要

访问文件

其它文件与链接

指纹

引用此