TY - JOUR
T1 - Audio denoising and audio-visual localization for unmanned aerial vehicles
AU - Li, Zhaojian
AU - Li, Jianbo
AU - Huang, Dengdian
AU - Jiao, Jianbin
AU - Li, Jingzhu
AU - Zhao, Bin
N1 - Publisher Copyright:
© 2026 Elsevier Ltd
PY - 2026/11
Y1 - 2026/11
N2 - The unmanned aerial vehicle (UAV) is increasingly drawing attention for its broad potential in disaster response, public safety, and intelligent monitoring. However, noise interference from UAV poses significant challenges to its audio-visual perception capabilities. To address these challenges, we introduce a novel task that integrates UAV audio-visual denoising and localization. To facilitate this research, we collect and construct a UAV audio-visual dataset in real-world environments. The dataset comprises audio and video captured by UAV, along with synchronized ground-based audio, providing a high-quality audio-visual benchmark for this task. We propose a visually guided audio denoising (VGAD) model, which generates a noise suppression mask through visual guidance to effectively attenuate UAV noise. To alleviate the perceptual similarity bias caused by single-anchor modeling, we propose an audio-visual anchor interaction (AVAI) localization model composed of an audio anchor localization (AAL) module and a visual anchor localization (VAL) module. The two modules leverage unsupervised dual contrastive learning to comprehensively capture perceptual similarities between the audio and visual modalities, thereby enhancing cross-modal semantic consistency and improving audio-visual localization performance. Extensive experiments on UAV audio-visual denoising and localization demonstrate that the proposed models significantly suppress UAV noise and improve localization performance. This work is the first to extend audio-visual localization to UAV scenarios, facilitating the advancement of UAV multimodal perception.
AB - The unmanned aerial vehicle (UAV) is increasingly drawing attention for its broad potential in disaster response, public safety, and intelligent monitoring. However, noise interference from UAV poses significant challenges to its audio-visual perception capabilities. To address these challenges, we introduce a novel task that integrates UAV audio-visual denoising and localization. To facilitate this research, we collect and construct a UAV audio-visual dataset in real-world environments. The dataset comprises audio and video captured by UAV, along with synchronized ground-based audio, providing a high-quality audio-visual benchmark for this task. We propose a visually guided audio denoising (VGAD) model, which generates a noise suppression mask through visual guidance to effectively attenuate UAV noise. To alleviate the perceptual similarity bias caused by single-anchor modeling, we propose an audio-visual anchor interaction (AVAI) localization model composed of an audio anchor localization (AAL) module and a visual anchor localization (VAL) module. The two modules leverage unsupervised dual contrastive learning to comprehensively capture perceptual similarities between the audio and visual modalities, thereby enhancing cross-modal semantic consistency and improving audio-visual localization performance. Extensive experiments on UAV audio-visual denoising and localization demonstrate that the proposed models significantly suppress UAV noise and improve localization performance. This work is the first to extend audio-visual localization to UAV scenarios, facilitating the advancement of UAV multimodal perception.
KW - Audio denoising
KW - Audio-visual localization
KW - Multimodal perception
KW - Unmanned aerial vehicle
UR - https://www.scopus.com/pages/publications/105038680595
U2 - 10.1016/j.patcog.2026.113931
DO - 10.1016/j.patcog.2026.113931
M3 - 文章
AN - SCOPUS:105038680595
SN - 0031-3203
VL - 179
JO - Pattern Recognition
JF - Pattern Recognition
M1 - 113931
ER -