Vits-Based Singing Voice Conversion Leveraging Whisper and Multi-Scale F0 Modeling

Ziqian Ning, Yuepeng Jiang, Zhichao Wang, Bin Zhang, Lei Xie

科研成果: 书/报告/会议事项章节会议稿件同行评审

8 引用 (Scopus)

摘要

This paper introduces the T23 team's system submitted to the Singing Voice Conversion Challenge 2023. Following the recognition-synthesis framework, our singing conversion model is based on VITS, incorporating four key modules: a prior encoder, a posterior encoder, a decoder, and a parallel bank of transposed convolutions (PBTC) module. We particularly leverage Whisper, a powerful pre-trained ASR model, to extract bottleneck features (BNF) as the input of the prior encoder. Before BNF extraction, we perform pitch perturbation to the source signal to remove speaker timbre, which effectively avoids the leakage of the source speaker timbre to the target. Moreover, the PBTC module extracts multi-scale F0 as the auxiliary input to the prior encoder, thereby capturing better pitch variations of singing. We design a three-stage training strategy to better adapt the base model to the target speaker with limited target speaker data. Official challenge results show that our system has superior performance in naturalness, ranking 1st and 2nd respectively in Task 1 and 2. Further ablation justifies the effectiveness of our system design.

源语言英语
主期刊名2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
出版商Institute of Electrical and Electronics Engineers Inc.
ISBN(电子版)9798350306897
DOI
出版状态已出版 - 2023
活动2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 - Taipei, 中国台湾
期限: 16 12月 202320 12月 2023

出版系列

姓名2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

会议

会议2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
国家/地区中国台湾
Taipei
时期16/12/2320/12/23

指纹

探究 'Vits-Based Singing Voice Conversion Leveraging Whisper and Multi-Scale F0 Modeling' 的科研主题。它们共同构成独一无二的指纹。

引用此