Less Means More: Single Stream Audio-Visual Sound Source Localization via Shared-Parameter Network

Tianyu Liu, Peng Zhang, Junwen Xiong, Chuanyue Li, Yue Huo, Wei Huang, Yufei Zha, Lei Xie, Yanning Zhang

Research output: Contribution to journalArticlepeer-review

Abstract

The architecture of two-stream network has been widely adopted in the task of audio-visual learning, especially for sound source localization. With a common way to separately process the different modalities, most current approaches establish the audio-visual correlation by maximizing the cosine similarity of representations from two streams. Unfortunately, the challenge of abundant inference parameters still limits this scheme to be further developed mainly because the parameter of modality-specific networks cannot be reused. Inspired by the mechanism of model averaging, in this study, an Iterative Multi-Modal Parameters Fusion (IMP-Fusion) strategy is proposed to fuse the network parameters during the training phase. By integrating the audio and visual knowledge into a unified architecture, a single-stream network is proposed to handle both modalities in the same time-round. Substantial experiments conducted on challenging benchmarks have validated a superior performance, even with only half of the inference parameter in comparison to the other state-of-the-art works. As a plug-and-play mechanism, the proposed IMP-Fusion strategy is also promising to benefit the design of future audio-visual networks.

Keywords

  • Audio-visual
  • Model averaging
  • Network parameter fusion
  • Sound source localization

Fingerprint

Dive into the research topics of 'Less Means More: Single Stream Audio-Visual Sound Source Localization via Shared-Parameter Network'. Together they form a unique fingerprint.

Cite this