Audio–visual correspondences based joint learning for instrumental playing source separation

Tianyu Liu, Peng Zhang, Siliang Wang, Wei Huang, Yufei Zha, Yanning Zhang

科研成果: 期刊稿件文章同行评审

摘要

Sound source separation is usually challenged by the identification of intermittently static sounding objects in realistic scenarios, especially when the sound generation is intrinsically related to variational motion as instrumental playing. Current solutions based on audio–visual learning have demonstrated a capability to enhance the sound source by modality interaction, but the training sample deficiency caused by unreliable pair-wise annotation still limits the separation performance to be further improved. Inspired by the sound-motion connection in realistic symphony scenario, in this work, a two-stage audio–visual network is proposed for instrumental audio separation. With a coarse-grained to fine-grained separation, a Patch-level Channel Audio–Visual Interaction (PCAVI) module is also introduced for elegant association between features of audio-motion modalities. By designing a regularization loss to compensate the imbalance of audio–visual feature distribution, a novel circulant learning is proposed to achieve more accurate and reliable performance. Substantial experiments have demonstrated that the proposed two-stage audio–visual instrumental audio separation network outperforms the motion-free methods on the MUSIC dataset, and achieves strong audio separation results in challenging real-world symphony scenarios.

源语言英语
文章编号128997
期刊Neurocomputing
618
DOI
出版状态已出版 - 14 2月 2025

指纹

探究 'Audio–visual correspondences based joint learning for instrumental playing source separation' 的科研主题。它们共同构成独一无二的指纹。

引用此