Abstract
Sound source separation is usually challenged by the identification of intermittently static sounding objects in realistic scenarios, especially when the sound generation is intrinsically related to variational motion as instrumental playing. Current solutions based on audio–visual learning have demonstrated a capability to enhance the sound source by modality interaction, but the training sample deficiency caused by unreliable pair-wise annotation still limits the separation performance to be further improved. Inspired by the sound-motion connection in realistic symphony scenario, in this work, a two-stage audio–visual network is proposed for instrumental audio separation. With a coarse-grained to fine-grained separation, a Patch-level Channel Audio–Visual Interaction (PCAVI) module is also introduced for elegant association between features of audio-motion modalities. By designing a regularization loss to compensate the imbalance of audio–visual feature distribution, a novel circulant learning is proposed to achieve more accurate and reliable performance. Substantial experiments have demonstrated that the proposed two-stage audio–visual instrumental audio separation network outperforms the motion-free methods on the MUSIC dataset, and achieves strong audio separation results in challenging real-world symphony scenarios.
| Original language | English |
|---|---|
| Article number | 128997 |
| Journal | Neurocomputing |
| Volume | 618 |
| DOIs | |
| State | Published - 14 Feb 2025 |
Keywords
- Audio–visual
- Circulant learning
- Sound source separation
- Sound-motion
Fingerprint
Dive into the research topics of 'Audio–visual correspondences based joint learning for instrumental playing source separation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver