TY - GEN
T1 - VISINGER
T2 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022
AU - Zhang, Yongmao
AU - Cong, Jian
AU - Xue, Heyang
AU - Xie, Lei
AU - Zhu, Pengcheng
AU - Bi, Mengxiao
N1 - Publisher Copyright:
© 2022 IEEE
PY - 2022
Y1 - 2022
N2 - In this paper, we propose VISinger, a complete end-to-end high-quality singing voice synthesis (SVS) system that directly generates singing audio from lyrics and musical score. Our approach is inspired by VITS [1], an end-to-end speech generation model which adopts VAE-based posterior encoder augmented with normalizing flow based prior encoder and adversarial decoder. VISinger follows the main architecture of VITS, but makes substantial improvements to the prior encoder according to the characteristics of singing. First, instead of using phoneme-level mean and variance of acoustic features, we introduce a length regulator and a frame prior network to get the frame-level mean and variance on acoustic features, modeling the rich acoustic variation in singing. Second, we further introduce an F0 predictor to guide the frame prior network, leading to stabler singing performance. Finally, to improve the singing rhythm, we modify the duration predictor to specifically predict the phoneme to note duration ratio, helped with singing note normalization. Experiments on a professional Mandarin singing corpus show that VISinger significantly outperforms FastSpeech+NeuralVocoder two-stage approach and the oracle VITS; ablation study demonstrates the effectiveness of different contributions.
AB - In this paper, we propose VISinger, a complete end-to-end high-quality singing voice synthesis (SVS) system that directly generates singing audio from lyrics and musical score. Our approach is inspired by VITS [1], an end-to-end speech generation model which adopts VAE-based posterior encoder augmented with normalizing flow based prior encoder and adversarial decoder. VISinger follows the main architecture of VITS, but makes substantial improvements to the prior encoder according to the characteristics of singing. First, instead of using phoneme-level mean and variance of acoustic features, we introduce a length regulator and a frame prior network to get the frame-level mean and variance on acoustic features, modeling the rich acoustic variation in singing. Second, we further introduce an F0 predictor to guide the frame prior network, leading to stabler singing performance. Finally, to improve the singing rhythm, we modify the duration predictor to specifically predict the phoneme to note duration ratio, helped with singing note normalization. Experiments on a professional Mandarin singing corpus show that VISinger significantly outperforms FastSpeech+NeuralVocoder two-stage approach and the oracle VITS; ablation study demonstrates the effectiveness of different contributions.
KW - Singing voice synthesis
KW - adversarial learning
KW - normalizing flows
KW - variational autoencoder
UR - http://www.scopus.com/inward/record.url?scp=85134008601&partnerID=8YFLogxK
U2 - 10.1109/ICASSP43922.2022.9747664
DO - 10.1109/ICASSP43922.2022.9747664
M3 - 会议稿件
AN - SCOPUS:85134008601
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 7237
EP - 7241
BT - 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 22 May 2022 through 27 May 2022
ER -