Abstract
Binaural audio is essential for delivering immersive spatial auditory experiences through headsets. However, due to the high cost and complexity of binaural recording, there has been growing research interest in binaural audio synthesis (BAS) from monaural inputs. In natural listening environments, humans typically perceive multiple concurrent sound sources, yet most existing BAS approaches render each source independently, relying on perfect source signal separation, a condition rarely achievable in practice and often leading to perceptual quality degradation. To address this limitation, this paper proposes MixBAS, a transformer based end-to-end multi-source mono-to-binaural synthesis framework that eliminates the need for explicit source separation. We design an asymmetric transformer that spatializes a mono mixture, which comprises both speech and non-speech components, into its binaural counterpart by incorporating a user-defined positional prompt for the non-speech source. When reproduced over headphones, the generated binaural audio enables listeners to perceive a high-quality speech signal along with a non-speech source rendered at a user-specified spatial location. Experimental results demonstrate that MixBAS significantly outperforms existing BAS baselines relying on source separation in both objective metrics and perceptual quality.
| Original language | English |
|---|---|
| Pages (from-to) | 1840-1852 |
| Number of pages | 13 |
| Journal | IEEE Transactions on Audio, Speech and Language Processing |
| Volume | 34 |
| DOIs | |
| State | Published - 2026 |
Keywords
- Mono-to-binaural audio synthesis
- asymmetric transformer
- audio spatialization
- binaural audio rendering
- multi-source binaural audio synthesis
Fingerprint
Dive into the research topics of 'MixBAS: A Transformer-Based End-to-End Mixed Mono to Binaural Audio Synthesis Method'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver