TY - JOUR
T1 - Glow-WaveGAN 2
T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022
AU - Lei, Yi
AU - Yang, Shan
AU - Cong, Jian
AU - Xie, Lei
AU - Su, Dan
N1 - Publisher Copyright:
Copyright © 2022 ISCA.
PY - 2022
Y1 - 2022
N2 - The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voices in zero-shot scenario exist in both stages - acoustic modeling and vocoder, previous works usually consider the problem from only one stage. In this paper, we extend our previous Glow-WaveGAN to Glow-WaveGAN 2, aiming to solve the problem from both stages for high-quality zero-shot text-to-speech and any-to-any voice conversion. We first build a universal WaveGAN model for extracting latent distribution p(z) of speech and reconstructing waveform from it. Then a flow-based acoustic model only needs to learn the same p(z) from texts, which naturally avoids the mismatch between the acoustic model and the vocoder, resulting in high-quality generated speech without model fine-tuning. Based on a continuous speaker space and the reversible property of flows, the conditional distribution can be obtained for any speaker, and thus we can further conduct high-quality zero-shot speech generation for new speakers. We particularly investigate two methods to construct the speaker space, namely pre-trained speaker encoder and jointly-trained speaker encoder. The superiority of Glow-WaveGAN 2 has been proved through TTS and VC experiments conducted on LibriTTS corpus and VTCK corpus.
AB - The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voices in zero-shot scenario exist in both stages - acoustic modeling and vocoder, previous works usually consider the problem from only one stage. In this paper, we extend our previous Glow-WaveGAN to Glow-WaveGAN 2, aiming to solve the problem from both stages for high-quality zero-shot text-to-speech and any-to-any voice conversion. We first build a universal WaveGAN model for extracting latent distribution p(z) of speech and reconstructing waveform from it. Then a flow-based acoustic model only needs to learn the same p(z) from texts, which naturally avoids the mismatch between the acoustic model and the vocoder, resulting in high-quality generated speech without model fine-tuning. Based on a continuous speaker space and the reversible property of flows, the conditional distribution can be obtained for any speaker, and thus we can further conduct high-quality zero-shot speech generation for new speakers. We particularly investigate two methods to construct the speaker space, namely pre-trained speaker encoder and jointly-trained speaker encoder. The superiority of Glow-WaveGAN 2 has been proved through TTS and VC experiments conducted on LibriTTS corpus and VTCK corpus.
KW - flow model
KW - speech synthesis
KW - variational auto-encoder
KW - voice conversion
KW - Zero-shot
UR - http://www.scopus.com/inward/record.url?scp=85140078056&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2022-684
DO - 10.21437/Interspeech.2022-684
M3 - 会议文章
AN - SCOPUS:85140078056
SN - 2308-457X
VL - 2022-September
SP - 2563
EP - 2567
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Y2 - 18 September 2022 through 22 September 2022
ER -