Skip to main navigation Skip to search Skip to main content

Less Means More: Single Stream Audio-Visual Sound Source Localization via Shared-Parameter Network

  • Northwestern Polytechnical University Xian
  • Yichun University

Research output: Contribution to journalArticlepeer-review

Abstract

The architecture of two-stream network has been widely adopted in the task of audio-visual learning, especially for sound source localization. With a common way to separately process the different modalities, most current approaches establish the audio-visual correlation by maximizing the cosine similarity of representations from two streams. Unfortunately, the challenge of abundant inference parameters still limits this scheme to be further developed mainly because the parameter of modality-specific networks cannot be reused. Inspired by the mechanism of model averaging, in this study, an Iterative Multi-Modal Parameters Fusion (IMP-Fusion) strategy is proposed to fuse the network parameters during the training phase. By integrating the audio and visual knowledge into a unified architecture, a single-stream network is proposed to handle both modalities in the same time-round. Substantial experiments conducted on challenging benchmarks have validated a superior performance, even with only half of the inference parameter in comparison to the other state-of-the-art works. As a plug-and-play mechanism, the proposed IMP-Fusion strategy is also promising to benefit the design of future audio-visual networks.

Original languageEnglish
Pages (from-to)4350-4360
Number of pages11
JournalIEEE Transactions on Audio, Speech and Language Processing
Volume33
DOIs
StatePublished - 2025

Keywords

  • Audio-visual
  • model averaging
  • network parameter fusion
  • sound source localization

Fingerprint

Dive into the research topics of 'Less Means More: Single Stream Audio-Visual Sound Source Localization via Shared-Parameter Network'. Together they form a unique fingerprint.

Cite this