TY - JOUR
T1 - MixBAS
T2 - A Transformer-Based End-to-End Mixed Mono to Binaural Audio Synthesis Method
AU - Pan, Ningning
AU - Guo, Yuanxin
AU - Jin, Jilu
AU - Chen, Zhongpu
AU - Zhao, Yu
AU - Chen, Jingdong
AU - Benesty, Jacob
N1 - Publisher Copyright:
© 2025 IEEE. All rights reserved.
PY - 2026
Y1 - 2026
N2 - Binaural audio is essential for delivering immersive spatial auditory experiences through headsets. However, due to the high cost and complexity of binaural recording, there has been growing research interest in binaural audio synthesis (BAS) from monaural inputs. In natural listening environments, humans typically perceive multiple concurrent sound sources, yet most existing BAS approaches render each source independently, relying on perfect source signal separation, a condition rarely achievable in practice and often leading to perceptual quality degradation. To address this limitation, this paper proposes MixBAS, a transformer based end-to-end multi-source mono-to-binaural synthesis framework that eliminates the need for explicit source separation. We design an asymmetric transformer that spatializes a mono mixture, which comprises both speech and non-speech components, into its binaural counterpart by incorporating a user-defined positional prompt for the non-speech source. When reproduced over headphones, the generated binaural audio enables listeners to perceive a high-quality speech signal along with a non-speech source rendered at a user-specified spatial location. Experimental results demonstrate that MixBAS significantly outperforms existing BAS baselines relying on source separation in both objective metrics and perceptual quality.
AB - Binaural audio is essential for delivering immersive spatial auditory experiences through headsets. However, due to the high cost and complexity of binaural recording, there has been growing research interest in binaural audio synthesis (BAS) from monaural inputs. In natural listening environments, humans typically perceive multiple concurrent sound sources, yet most existing BAS approaches render each source independently, relying on perfect source signal separation, a condition rarely achievable in practice and often leading to perceptual quality degradation. To address this limitation, this paper proposes MixBAS, a transformer based end-to-end multi-source mono-to-binaural synthesis framework that eliminates the need for explicit source separation. We design an asymmetric transformer that spatializes a mono mixture, which comprises both speech and non-speech components, into its binaural counterpart by incorporating a user-defined positional prompt for the non-speech source. When reproduced over headphones, the generated binaural audio enables listeners to perceive a high-quality speech signal along with a non-speech source rendered at a user-specified spatial location. Experimental results demonstrate that MixBAS significantly outperforms existing BAS baselines relying on source separation in both objective metrics and perceptual quality.
KW - Mono-to-binaural audio synthesis
KW - asymmetric transformer
KW - audio spatialization
KW - binaural audio rendering
KW - multi-source binaural audio synthesis
UR - https://www.scopus.com/pages/publications/105033708873
U2 - 10.1109/TASLPRO.2026.3675770
DO - 10.1109/TASLPRO.2026.3675770
M3 - 文章
AN - SCOPUS:105033708873
SN - 1558-7916
VL - 34
SP - 1840
EP - 1852
JO - IEEE Transactions on Audio, Speech and Language Processing
JF - IEEE Transactions on Audio, Speech and Language Processing
ER -