跳到主要导航 跳到搜索 跳到主要内容

Less Means More: Single Stream Audio-Visual Sound Source Localization via Shared-Parameter Network

  • Northwestern Polytechnical University Xian
  • Yichun University

科研成果: 期刊稿件文章同行评审

摘要

The architecture of two-stream network has been widely adopted in the task of audio-visual learning, especially for sound source localization. With a common way to separately process the different modalities, most current approaches establish the audio-visual correlation by maximizing the cosine similarity of representations from two streams. Unfortunately, the challenge of abundant inference parameters still limits this scheme to be further developed mainly because the parameter of modality-specific networks cannot be reused. Inspired by the mechanism of model averaging, in this study, an Iterative Multi-Modal Parameters Fusion (IMP-Fusion) strategy is proposed to fuse the network parameters during the training phase. By integrating the audio and visual knowledge into a unified architecture, a single-stream network is proposed to handle both modalities in the same time-round. Substantial experiments conducted on challenging benchmarks have validated a superior performance, even with only half of the inference parameter in comparison to the other state-of-the-art works. As a plug-and-play mechanism, the proposed IMP-Fusion strategy is also promising to benefit the design of future audio-visual networks.

源语言英语
页(从-至)4350-4360
页数11
期刊IEEE Transactions on Audio, Speech and Language Processing
33
DOI
出版状态已出版 - 2025

指纹

探究 'Less Means More: Single Stream Audio-Visual Sound Source Localization via Shared-Parameter Network' 的科研主题。它们共同构成独一无二的指纹。

引用此