Abstract
The unmanned aerial vehicle (UAV) is increasingly drawing attention for its broad potential in disaster response, public safety, and intelligent monitoring. However, noise interference from UAV poses significant challenges to its audio-visual perception capabilities. To address these challenges, we introduce a novel task that integrates UAV audio-visual denoising and localization. To facilitate this research, we collect and construct a UAV audio-visual dataset in real-world environments. The dataset comprises audio and video captured by UAV, along with synchronized ground-based audio, providing a high-quality audio-visual benchmark for this task. We propose a visually guided audio denoising (VGAD) model, which generates a noise suppression mask through visual guidance to effectively attenuate UAV noise. To alleviate the perceptual similarity bias caused by single-anchor modeling, we propose an audio-visual anchor interaction (AVAI) localization model composed of an audio anchor localization (AAL) module and a visual anchor localization (VAL) module. The two modules leverage unsupervised dual contrastive learning to comprehensively capture perceptual similarities between the audio and visual modalities, thereby enhancing cross-modal semantic consistency and improving audio-visual localization performance. Extensive experiments on UAV audio-visual denoising and localization demonstrate that the proposed models significantly suppress UAV noise and improve localization performance. This work is the first to extend audio-visual localization to UAV scenarios, facilitating the advancement of UAV multimodal perception.
| Original language | English |
|---|---|
| Article number | 113931 |
| Journal | Pattern Recognition |
| Volume | 179 |
| DOIs | |
| State | Published - Nov 2026 |
Keywords
- Audio denoising
- Audio-visual localization
- Multimodal perception
- Unmanned aerial vehicle
Fingerprint
Dive into the research topics of 'Audio denoising and audio-visual localization for unmanned aerial vehicles'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver