TY - GEN
T1 - Vits-Based Singing Voice Conversion Leveraging Whisper and Multi-Scale F0 Modeling
AU - Ning, Ziqian
AU - Jiang, Yuepeng
AU - Wang, Zhichao
AU - Zhang, Bin
AU - Xie, Lei
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - This paper introduces the T23 team's system submitted to the Singing Voice Conversion Challenge 2023. Following the recognition-synthesis framework, our singing conversion model is based on VITS, incorporating four key modules: a prior encoder, a posterior encoder, a decoder, and a parallel bank of transposed convolutions (PBTC) module. We particularly leverage Whisper, a powerful pre-trained ASR model, to extract bottleneck features (BNF) as the input of the prior encoder. Before BNF extraction, we perform pitch perturbation to the source signal to remove speaker timbre, which effectively avoids the leakage of the source speaker timbre to the target. Moreover, the PBTC module extracts multi-scale F0 as the auxiliary input to the prior encoder, thereby capturing better pitch variations of singing. We design a three-stage training strategy to better adapt the base model to the target speaker with limited target speaker data. Official challenge results show that our system has superior performance in naturalness, ranking 1st and 2nd respectively in Task 1 and 2. Further ablation justifies the effectiveness of our system design.
AB - This paper introduces the T23 team's system submitted to the Singing Voice Conversion Challenge 2023. Following the recognition-synthesis framework, our singing conversion model is based on VITS, incorporating four key modules: a prior encoder, a posterior encoder, a decoder, and a parallel bank of transposed convolutions (PBTC) module. We particularly leverage Whisper, a powerful pre-trained ASR model, to extract bottleneck features (BNF) as the input of the prior encoder. Before BNF extraction, we perform pitch perturbation to the source signal to remove speaker timbre, which effectively avoids the leakage of the source speaker timbre to the target. Moreover, the PBTC module extracts multi-scale F0 as the auxiliary input to the prior encoder, thereby capturing better pitch variations of singing. We design a three-stage training strategy to better adapt the base model to the target speaker with limited target speaker data. Official challenge results show that our system has superior performance in naturalness, ranking 1st and 2nd respectively in Task 1 and 2. Further ablation justifies the effectiveness of our system design.
KW - F0-modeling
KW - Singing voice conversion
KW - VITS
UR - http://www.scopus.com/inward/record.url?scp=85184666023&partnerID=8YFLogxK
U2 - 10.1109/ASRU57964.2023.10389740
DO - 10.1109/ASRU57964.2023.10389740
M3 - 会议稿件
AN - SCOPUS:85184666023
T3 - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
BT - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
Y2 - 16 December 2023 through 20 December 2023
ER -