摘要
Sound source separation is usually challenged by the identification of intermittently static sounding objects in realistic scenarios, especially when the sound generation is intrinsically related to variational motion as instrumental playing. Current solutions based on audio–visual learning have demonstrated a capability to enhance the sound source by modality interaction, but the training sample deficiency caused by unreliable pair-wise annotation still limits the separation performance to be further improved. Inspired by the sound-motion connection in realistic symphony scenario, in this work, a two-stage audio–visual network is proposed for instrumental audio separation. With a coarse-grained to fine-grained separation, a Patch-level Channel Audio–Visual Interaction (PCAVI) module is also introduced for elegant association between features of audio-motion modalities. By designing a regularization loss to compensate the imbalance of audio–visual feature distribution, a novel circulant learning is proposed to achieve more accurate and reliable performance. Substantial experiments have demonstrated that the proposed two-stage audio–visual instrumental audio separation network outperforms the motion-free methods on the MUSIC dataset, and achieves strong audio separation results in challenging real-world symphony scenarios.
源语言 | 英语 |
---|---|
文章编号 | 128997 |
期刊 | Neurocomputing |
卷 | 618 |
DOI | |
出版状态 | 已出版 - 14 2月 2025 |