Vits-Based Singing Voice Conversion Leveraging Whisper and Multi-Scale F0 Modeling

Ziqian Ning; Yuepeng Jiang; Zhichao Wang; Bin Zhang; Lei Xie

doi:10.1109/ASRU57964.2023.10389740

Vits-Based Singing Voice Conversion Leveraging Whisper and Multi-Scale F0 Modeling

Ziqian Ning, Yuepeng Jiang, Zhichao Wang, Bin Zhang, Lei Xie

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

8 引用（Scopus）

摘要

This paper introduces the T23 team's system submitted to the Singing Voice Conversion Challenge 2023. Following the recognition-synthesis framework, our singing conversion model is based on VITS, incorporating four key modules: a prior encoder, a posterior encoder, a decoder, and a parallel bank of transposed convolutions (PBTC) module. We particularly leverage Whisper, a powerful pre-trained ASR model, to extract bottleneck features (BNF) as the input of the prior encoder. Before BNF extraction, we perform pitch perturbation to the source signal to remove speaker timbre, which effectively avoids the leakage of the source speaker timbre to the target. Moreover, the PBTC module extracts multi-scale F0 as the auxiliary input to the prior encoder, thereby capturing better pitch variations of singing. We design a three-stage training strategy to better adapt the base model to the target speaker with limited target speaker data. Official challenge results show that our system has superior performance in naturalness, ranking 1st and 2nd respectively in Task 1 and 2. Further ablation justifies the effectiveness of our system design.

源语言	英语
主期刊名	2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
出版商	Institute of Electrical and Electronics Engineers Inc.
ISBN（电子版）	9798350306897
DOI	https://doi.org/10.1109/ASRU57964.2023.10389740
出版状态	已出版 - 2023
活动	2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 - Taipei, 中国台湾期限: 16 12月 2023 → 20 12月 2023

出版系列

姓名	2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

会议

会议	2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
国家/地区	中国台湾
市	Taipei
时期	16/12/23 → 20/12/23

访问文件

10.1109/ASRU57964.2023.10389740

其它文件与链接

链接到 Scopus 的出版物

引用此

Ning, Z., Jiang, Y., Wang, Z., Zhang, B., & Xie, L. (2023). Vits-Based Singing Voice Conversion Leveraging Whisper and Multi-Scale F0 Modeling. 在 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 (2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASRU57964.2023.10389740

@inproceedings{a15faf14e6554952a037911e59e2a015,

title = "Vits-Based Singing Voice Conversion Leveraging Whisper and Multi-Scale F0 Modeling",

abstract = "This paper introduces the T23 team's system submitted to the Singing Voice Conversion Challenge 2023. Following the recognition-synthesis framework, our singing conversion model is based on VITS, incorporating four key modules: a prior encoder, a posterior encoder, a decoder, and a parallel bank of transposed convolutions (PBTC) module. We particularly leverage Whisper, a powerful pre-trained ASR model, to extract bottleneck features (BNF) as the input of the prior encoder. Before BNF extraction, we perform pitch perturbation to the source signal to remove speaker timbre, which effectively avoids the leakage of the source speaker timbre to the target. Moreover, the PBTC module extracts multi-scale F0 as the auxiliary input to the prior encoder, thereby capturing better pitch variations of singing. We design a three-stage training strategy to better adapt the base model to the target speaker with limited target speaker data. Official challenge results show that our system has superior performance in naturalness, ranking 1st and 2nd respectively in Task 1 and 2. Further ablation justifies the effectiveness of our system design.",

keywords = "F0-modeling, Singing voice conversion, VITS",

author = "Ziqian Ning and Yuepeng Jiang and Zhichao Wang and Bin Zhang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 ; Conference date: 16-12-2023 Through 20-12-2023",

year = "2023",

doi = "10.1109/ASRU57964.2023.10389740",

language = "英语",

series = "2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023",

}

Ning, Z, Jiang, Y, Wang, Z, Zhang, B & Xie, L 2023, Vits-Based Singing Voice Conversion Leveraging Whisper and Multi-Scale F0 Modeling. 在 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023. 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023, Institute of Electrical and Electronics Engineers Inc., 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023, Taipei, 中国台湾, 16/12/23. https://doi.org/10.1109/ASRU57964.2023.10389740

Vits-Based Singing Voice Conversion Leveraging Whisper and Multi-Scale F0 Modeling. / Ning, Ziqian; Jiang, Yuepeng; Wang, Zhichao 等.
2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023. Institute of Electrical and Electronics Engineers Inc., 2023. (2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Vits-Based Singing Voice Conversion Leveraging Whisper and Multi-Scale F0 Modeling

AU - Ning, Ziqian

AU - Jiang, Yuepeng

AU - Wang, Zhichao

AU - Zhang, Bin

AU - Xie, Lei

PY - 2023

Y1 - 2023

N2 - This paper introduces the T23 team's system submitted to the Singing Voice Conversion Challenge 2023. Following the recognition-synthesis framework, our singing conversion model is based on VITS, incorporating four key modules: a prior encoder, a posterior encoder, a decoder, and a parallel bank of transposed convolutions (PBTC) module. We particularly leverage Whisper, a powerful pre-trained ASR model, to extract bottleneck features (BNF) as the input of the prior encoder. Before BNF extraction, we perform pitch perturbation to the source signal to remove speaker timbre, which effectively avoids the leakage of the source speaker timbre to the target. Moreover, the PBTC module extracts multi-scale F0 as the auxiliary input to the prior encoder, thereby capturing better pitch variations of singing. We design a three-stage training strategy to better adapt the base model to the target speaker with limited target speaker data. Official challenge results show that our system has superior performance in naturalness, ranking 1st and 2nd respectively in Task 1 and 2. Further ablation justifies the effectiveness of our system design.

AB - This paper introduces the T23 team's system submitted to the Singing Voice Conversion Challenge 2023. Following the recognition-synthesis framework, our singing conversion model is based on VITS, incorporating four key modules: a prior encoder, a posterior encoder, a decoder, and a parallel bank of transposed convolutions (PBTC) module. We particularly leverage Whisper, a powerful pre-trained ASR model, to extract bottleneck features (BNF) as the input of the prior encoder. Before BNF extraction, we perform pitch perturbation to the source signal to remove speaker timbre, which effectively avoids the leakage of the source speaker timbre to the target. Moreover, the PBTC module extracts multi-scale F0 as the auxiliary input to the prior encoder, thereby capturing better pitch variations of singing. We design a three-stage training strategy to better adapt the base model to the target speaker with limited target speaker data. Official challenge results show that our system has superior performance in naturalness, ranking 1st and 2nd respectively in Task 1 and 2. Further ablation justifies the effectiveness of our system design.

KW - F0-modeling

KW - Singing voice conversion

KW - VITS

UR - http://www.scopus.com/inward/record.url?scp=85184666023&partnerID=8YFLogxK

U2 - 10.1109/ASRU57964.2023.10389740

DO - 10.1109/ASRU57964.2023.10389740

M3 - 会议稿件

AN - SCOPUS:85184666023

T3 - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

BT - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

Y2 - 16 December 2023 through 20 December 2023

ER -

Ning Z, Jiang Y, Wang Z, Zhang B, Xie L. Vits-Based Singing Voice Conversion Leveraging Whisper and Multi-Scale F0 Modeling. 在 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023. Institute of Electrical and Electronics Engineers Inc. 2023. (2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023). doi: 10.1109/ASRU57964.2023.10389740

Vits-Based Singing Voice Conversion Leveraging Whisper and Multi-Scale F0 Modeling

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此