VISINGER: VARIATIONAL INFERENCE WITH ADVERSARIAL LEARNING FOR END-TO-END SINGING VOICE SYNTHESIS

Yongmao Zhang; Jian Cong; Heyang Xue; Lei Xie; Pengcheng Zhu; Mengxiao Bi

doi:10.1109/ICASSP43922.2022.9747664

VISINGER: VARIATIONAL INFERENCE WITH ADVERSARIAL LEARNING FOR END-TO-END SINGING VOICE SYNTHESIS

Yongmao Zhang, Jian Cong, Heyang Xue, Lei Xie, Pengcheng Zhu, Mengxiao Bi

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

66 Scopus citations

Abstract

In this paper, we propose VISinger, a complete end-to-end high-quality singing voice synthesis (SVS) system that directly generates singing audio from lyrics and musical score. Our approach is inspired by VITS [1], an end-to-end speech generation model which adopts VAE-based posterior encoder augmented with normalizing flow based prior encoder and adversarial decoder. VISinger follows the main architecture of VITS, but makes substantial improvements to the prior encoder according to the characteristics of singing. First, instead of using phoneme-level mean and variance of acoustic features, we introduce a length regulator and a frame prior network to get the frame-level mean and variance on acoustic features, modeling the rich acoustic variation in singing. Second, we further introduce an F0 predictor to guide the frame prior network, leading to stabler singing performance. Finally, to improve the singing rhythm, we modify the duration predictor to specifically predict the phoneme to note duration ratio, helped with singing note normalization. Experiments on a professional Mandarin singing corpus show that VISinger significantly outperforms FastSpeech+NeuralVocoder two-stage approach and the oracle VITS; ablation study demonstrates the effectiveness of different contributions.

Original language	English
Title of host publication	2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	7237-7241
Number of pages	5
ISBN (Electronic)	9781665405409
DOIs	https://doi.org/10.1109/ICASSP43922.2022.9747664
State	Published - 2022
Event	2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022 - Hybrid, Singapore Duration: 22 May 2022 → 27 May 2022

Publication series

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume	2022-May
ISSN (Print)	1520-6149

Conference

Conference	2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022
Country/Territory	Singapore
City	Hybrid
Period	22/05/22 → 27/05/22

Keywords

Singing voice synthesis
adversarial learning
normalizing flows
variational autoencoder

Access to Document

10.1109/ICASSP43922.2022.9747664

Cite this

Zhang, Y., Cong, J., Xue, H., Xie, L., Zhu, P., & Bi, M. (2022). VISINGER: VARIATIONAL INFERENCE WITH ADVERSARIAL LEARNING FOR END-TO-END SINGING VOICE SYNTHESIS. In 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings (pp. 7237-7241). (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2022-May). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP43922.2022.9747664

Zhang, Yongmao ; Cong, Jian ; Xue, Heyang et al. / VISINGER : VARIATIONAL INFERENCE WITH ADVERSARIAL LEARNING FOR END-TO-END SINGING VOICE SYNTHESIS. 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2022. pp. 7237-7241 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{86cc8a7afbc1469e92b7a59c14ce36c1,

title = "VISINGER: VARIATIONAL INFERENCE WITH ADVERSARIAL LEARNING FOR END-TO-END SINGING VOICE SYNTHESIS",

abstract = "In this paper, we propose VISinger, a complete end-to-end high-quality singing voice synthesis (SVS) system that directly generates singing audio from lyrics and musical score. Our approach is inspired by VITS [1], an end-to-end speech generation model which adopts VAE-based posterior encoder augmented with normalizing flow based prior encoder and adversarial decoder. VISinger follows the main architecture of VITS, but makes substantial improvements to the prior encoder according to the characteristics of singing. First, instead of using phoneme-level mean and variance of acoustic features, we introduce a length regulator and a frame prior network to get the frame-level mean and variance on acoustic features, modeling the rich acoustic variation in singing. Second, we further introduce an F0 predictor to guide the frame prior network, leading to stabler singing performance. Finally, to improve the singing rhythm, we modify the duration predictor to specifically predict the phoneme to note duration ratio, helped with singing note normalization. Experiments on a professional Mandarin singing corpus show that VISinger significantly outperforms FastSpeech+NeuralVocoder two-stage approach and the oracle VITS; ablation study demonstrates the effectiveness of different contributions.",

keywords = "Singing voice synthesis, adversarial learning, normalizing flows, variational autoencoder",

author = "Yongmao Zhang and Jian Cong and Heyang Xue and Lei Xie and Pengcheng Zhu and Mengxiao Bi",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE; 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022 ; Conference date: 22-05-2022 Through 27-05-2022",

year = "2022",

doi = "10.1109/ICASSP43922.2022.9747664",

language = "英语",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "7237--7241",

booktitle = "2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings",

}

Zhang, Y, Cong, J, Xue, H, Xie, L, Zhu, P & Bi, M 2022, VISINGER: VARIATIONAL INFERENCE WITH ADVERSARIAL LEARNING FOR END-TO-END SINGING VOICE SYNTHESIS. in 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2022-May, Institute of Electrical and Electronics Engineers Inc., pp. 7237-7241, 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Hybrid, Singapore, 22/05/22. https://doi.org/10.1109/ICASSP43922.2022.9747664

VISINGER: VARIATIONAL INFERENCE WITH ADVERSARIAL LEARNING FOR END-TO-END SINGING VOICE SYNTHESIS. / Zhang, Yongmao; Cong, Jian; Xue, Heyang et al.
2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2022. p. 7237-7241 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2022-May).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - VISINGER

T2 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022

AU - Zhang, Yongmao

AU - Cong, Jian

AU - Xue, Heyang

AU - Xie, Lei

AU - Zhu, Pengcheng

AU - Bi, Mengxiao

PY - 2022

Y1 - 2022

N2 - In this paper, we propose VISinger, a complete end-to-end high-quality singing voice synthesis (SVS) system that directly generates singing audio from lyrics and musical score. Our approach is inspired by VITS [1], an end-to-end speech generation model which adopts VAE-based posterior encoder augmented with normalizing flow based prior encoder and adversarial decoder. VISinger follows the main architecture of VITS, but makes substantial improvements to the prior encoder according to the characteristics of singing. First, instead of using phoneme-level mean and variance of acoustic features, we introduce a length regulator and a frame prior network to get the frame-level mean and variance on acoustic features, modeling the rich acoustic variation in singing. Second, we further introduce an F0 predictor to guide the frame prior network, leading to stabler singing performance. Finally, to improve the singing rhythm, we modify the duration predictor to specifically predict the phoneme to note duration ratio, helped with singing note normalization. Experiments on a professional Mandarin singing corpus show that VISinger significantly outperforms FastSpeech+NeuralVocoder two-stage approach and the oracle VITS; ablation study demonstrates the effectiveness of different contributions.

AB - In this paper, we propose VISinger, a complete end-to-end high-quality singing voice synthesis (SVS) system that directly generates singing audio from lyrics and musical score. Our approach is inspired by VITS [1], an end-to-end speech generation model which adopts VAE-based posterior encoder augmented with normalizing flow based prior encoder and adversarial decoder. VISinger follows the main architecture of VITS, but makes substantial improvements to the prior encoder according to the characteristics of singing. First, instead of using phoneme-level mean and variance of acoustic features, we introduce a length regulator and a frame prior network to get the frame-level mean and variance on acoustic features, modeling the rich acoustic variation in singing. Second, we further introduce an F0 predictor to guide the frame prior network, leading to stabler singing performance. Finally, to improve the singing rhythm, we modify the duration predictor to specifically predict the phoneme to note duration ratio, helped with singing note normalization. Experiments on a professional Mandarin singing corpus show that VISinger significantly outperforms FastSpeech+NeuralVocoder two-stage approach and the oracle VITS; ablation study demonstrates the effectiveness of different contributions.

KW - Singing voice synthesis

KW - adversarial learning

KW - normalizing flows

KW - variational autoencoder

UR - http://www.scopus.com/inward/record.url?scp=85134008601&partnerID=8YFLogxK

U2 - 10.1109/ICASSP43922.2022.9747664

DO - 10.1109/ICASSP43922.2022.9747664

M3 - 会议稿件

AN - SCOPUS:85134008601

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 7237

EP - 7241

BT - 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 22 May 2022 through 27 May 2022

ER -

Zhang Y, Cong J, Xue H, Xie L, Zhu P, Bi M. VISINGER: VARIATIONAL INFERENCE WITH ADVERSARIAL LEARNING FOR END-TO-END SINGING VOICE SYNTHESIS. In 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2022. p. 7237-7241. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP43922.2022.9747664

VISINGER: VARIATIONAL INFERENCE WITH ADVERSARIAL LEARNING FOR END-TO-END SINGING VOICE SYNTHESIS

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this