TY - JOUR
T1 - Audio–visual representation learning for anomaly events detection in crowds
AU - Gao, Junyu
AU - Yang, Hao
AU - Gong, Maoguo
AU - Li, Xuelong
N1 - Publisher Copyright:
© 2024 Elsevier B.V.
PY - 2024/5/14
Y1 - 2024/5/14
N2 - In recent years, anomaly events detection in crowd scenes attracts many researchers’ attentions, because of its importance to public safety. Existing methods usually exploit visual information to analyze whether any abnormal events have occurred due to only visual sensors are generally equipped in public places. However, when an abnormal event in crowds occurs, sound information may be discriminative to assist the crowd analysis system to determine whether there is an abnormality. Compared with vision information that is easily occluded, audio signals have a certain degree of penetration. Thus, this paper attempt to exploit multi-modal learning for modeling the audio and visual signals simultaneously. To be specific, we design a two-branch network to model different types of information. The first is a typical 3D CNN model to extract temporal appearance feature from video clips. The second is an audio CNN for encoding Log Mel-Spectrogram of audio signals. Finally, by fusing the above features, the more accurate prediction will be produced. We conduct the experiments on SHADE dataset, a synthetic audio–visual dataset in surveillance scenes, and find introducing audio signals effectively improves the performance of anomaly events detection and outperforms other state-of-the-art methods. Furthermore, we will release the code and the pre-trained models as soon as possible.
AB - In recent years, anomaly events detection in crowd scenes attracts many researchers’ attentions, because of its importance to public safety. Existing methods usually exploit visual information to analyze whether any abnormal events have occurred due to only visual sensors are generally equipped in public places. However, when an abnormal event in crowds occurs, sound information may be discriminative to assist the crowd analysis system to determine whether there is an abnormality. Compared with vision information that is easily occluded, audio signals have a certain degree of penetration. Thus, this paper attempt to exploit multi-modal learning for modeling the audio and visual signals simultaneously. To be specific, we design a two-branch network to model different types of information. The first is a typical 3D CNN model to extract temporal appearance feature from video clips. The second is an audio CNN for encoding Log Mel-Spectrogram of audio signals. Finally, by fusing the above features, the more accurate prediction will be produced. We conduct the experiments on SHADE dataset, a synthetic audio–visual dataset in surveillance scenes, and find introducing audio signals effectively improves the performance of anomaly events detection and outperforms other state-of-the-art methods. Furthermore, we will release the code and the pre-trained models as soon as possible.
KW - Anomaly events detection
KW - Audio–visual representation learning
KW - Crowd analysis
KW - Multi-modal learning
UR - http://www.scopus.com/inward/record.url?scp=85188596248&partnerID=8YFLogxK
U2 - 10.1016/j.neucom.2024.127489
DO - 10.1016/j.neucom.2024.127489
M3 - 文章
AN - SCOPUS:85188596248
SN - 0925-2312
VL - 582
JO - Neurocomputing
JF - Neurocomputing
M1 - 127489
ER -