TY - JOUR
T1 - Less Means More
T2 - Single Stream Audio-Visual Sound Source Localization via Shared-Parameter Network
AU - Liu, Tianyu
AU - Zhang, Peng
AU - Xiong, Junwen
AU - Li, Chuanyue
AU - Huo, Yue
AU - Huang, Wei
AU - Zha, Yufei
AU - Xie, Lei
AU - Zhang, Yanning
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - The architecture of two-stream network has been widely adopted in the task of audio-visual learning, especially for sound source localization. With a common way to separately process the different modalities, most current approaches establish the audio-visual correlation by maximizing the cosine similarity of representations from two streams. Unfortunately, the challenge of abundant inference parameters still limits this scheme to be further developed mainly because the parameter of modality-specific networks cannot be reused. Inspired by the mechanism of model averaging, in this study, an Iterative Multi-Modal Parameters Fusion (IMP-Fusion) strategy is proposed to fuse the network parameters during the training phase. By integrating the audio and visual knowledge into a unified architecture, a single-stream network is proposed to handle both modalities in the same time-round. Substantial experiments conducted on challenging benchmarks have validated a superior performance, even with only half of the inference parameter in comparison to the other state-of-the-art works. As a plug-and-play mechanism, the proposed IMP-Fusion strategy is also promising to benefit the design of future audio-visual networks.
AB - The architecture of two-stream network has been widely adopted in the task of audio-visual learning, especially for sound source localization. With a common way to separately process the different modalities, most current approaches establish the audio-visual correlation by maximizing the cosine similarity of representations from two streams. Unfortunately, the challenge of abundant inference parameters still limits this scheme to be further developed mainly because the parameter of modality-specific networks cannot be reused. Inspired by the mechanism of model averaging, in this study, an Iterative Multi-Modal Parameters Fusion (IMP-Fusion) strategy is proposed to fuse the network parameters during the training phase. By integrating the audio and visual knowledge into a unified architecture, a single-stream network is proposed to handle both modalities in the same time-round. Substantial experiments conducted on challenging benchmarks have validated a superior performance, even with only half of the inference parameter in comparison to the other state-of-the-art works. As a plug-and-play mechanism, the proposed IMP-Fusion strategy is also promising to benefit the design of future audio-visual networks.
KW - Audio-visual
KW - Model averaging
KW - Network parameter fusion
KW - Sound source localization
UR - https://www.scopus.com/pages/publications/105018727744
U2 - 10.1109/TASLPRO.2025.3619850
DO - 10.1109/TASLPRO.2025.3619850
M3 - 文章
AN - SCOPUS:105018727744
SN - 1558-7916
JO - IEEE Transactions on Audio, Speech and Language Processing
JF - IEEE Transactions on Audio, Speech and Language Processing
ER -