TY - GEN
T1 - MBTFNET
T2 - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
AU - Xu, Weiming
AU - Chen, Zhouxuan
AU - Tan, Zhili
AU - Lv, Shubo
AU - Han, Runduo
AU - Zhou, Wenjiang
AU - Zhao, Weifeng
AU - Xie, Lei
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - A typical neural speech enhancement (SE) approach mainly handles speech and noise mixtures, which is not optimal for singing voice enhancement scenarios where singing is often mixed with vocal-correlated accompanies and singing has substantial differences from speaking. Music source separation (MSS) models treat vocals and various accompaniment components equally, which may reduce performance compared to the model that only considers vocal enhancement. In this paper, we propose a novel multi-band temporal-frequency neural network (MBTFNet) for singing voice enhancement, which particularly removes background music, noise and even backing vocals from singing recordings. MBTFNet combines inter and intra-band modeling for better processing of full-band signals. Dual-path modeling in the temporal and frequency axis and temporal dilation blocks are introduced to expand the receptive field of the model. Particularly for removing backing vocals, we propose an implicit personalized enhancement (IPE) stage based on signal-to-noise ratio (SNR) estimation, which further improves the performance of MBTFNet. Experiments show that our proposed model significantly outperforms several state-of-the-art SE and MSS models.
AB - A typical neural speech enhancement (SE) approach mainly handles speech and noise mixtures, which is not optimal for singing voice enhancement scenarios where singing is often mixed with vocal-correlated accompanies and singing has substantial differences from speaking. Music source separation (MSS) models treat vocals and various accompaniment components equally, which may reduce performance compared to the model that only considers vocal enhancement. In this paper, we propose a novel multi-band temporal-frequency neural network (MBTFNet) for singing voice enhancement, which particularly removes background music, noise and even backing vocals from singing recordings. MBTFNet combines inter and intra-band modeling for better processing of full-band signals. Dual-path modeling in the temporal and frequency axis and temporal dilation blocks are introduced to expand the receptive field of the model. Particularly for removing backing vocals, we propose an implicit personalized enhancement (IPE) stage based on signal-to-noise ratio (SNR) estimation, which further improves the performance of MBTFNet. Experiments show that our proposed model significantly outperforms several state-of-the-art SE and MSS models.
KW - implicit personalized enhancement
KW - MBTFNet
KW - singing-voice enhancement
UR - http://www.scopus.com/inward/record.url?scp=85184664663&partnerID=8YFLogxK
U2 - 10.1109/ASRU57964.2023.10389670
DO - 10.1109/ASRU57964.2023.10389670
M3 - 会议稿件
AN - SCOPUS:85184664663
T3 - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
BT - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 16 December 2023 through 20 December 2023
ER -