Glow-WaveGAN: Learning speech representations from gan-based variational auto-encoder for high fidelity flow-based speech synthesis

Jian Cong; Shan Yang; Lei Xie; Dan Su

doi:10.21437/Interspeech.2021-414

Glow-WaveGAN: Learning speech representations from gan-based variational auto-encoder for high fidelity flow-based speech synthesis

Jian Cong, Shan Yang, Lei Xie, Dan Su

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

10 Scopus citations

Abstract

Current two-stage TTS framework typically integrates an acoustic model with a vocoder - the acoustic model predicts a low resolution intermediate representation such as Mel-spectrum while the vocoder generates waveform from the intermediate representation. Although the intermediate representation is served as a bridge, there still exists critical mismatch between the acoustic model and the vocoder as they are commonly separately learned and work on different distributions of representation, leading to inevitable artifacts in the synthesized speech. In this work, different from using pre-designed intermediate representation in most previous studies, we propose to use VAE combining with GAN to learn a latent representation directly from speech and then utilize a flow-based acoustic model to model the distribution of the latent representation from text. In this way, the mismatch problem is migrated as the two stages work on the same distribution. Results demonstrate that the flow-based acoustic model can exactly model the distribution of our learned speech representation and the proposed TTS framework, namely Glow-WaveGAN, can produce high fidelity speech outperforming the state-of-the-art GAN-based model.

Original language	English
Title of host publication	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Publisher	International Speech Communication Association
Pages	3241-3245
Number of pages	5
ISBN (Electronic)	9781713836902
DOIs	https://doi.org/10.21437/Interspeech.2021-414
State	Published - 2021
Event	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 - Brno, Czech Republic Duration: 30 Aug 2021 → 3 Sep 2021

Publication series

Name	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	5
ISSN (Print)	2308-457X
ISSN (Electronic)	1990-9772

Conference

Conference	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Country/Territory	Czech Republic
City	Brno
Period	30/08/21 → 3/09/21

Keywords

Generative adversarial network
Speech representations
Speech synthesis
Variational auto-encoder

Access to Document

10.21437/Interspeech.2021-414

Cite this

Cong, J., Yang, S., Xie, L., & Su, D. (2021). Glow-WaveGAN: Learning speech representations from gan-based variational auto-encoder for high fidelity flow-based speech synthesis. In 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 (pp. 3241-3245). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 5). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2021-414

Cong, Jian ; Yang, Shan ; Xie, Lei et al. / Glow-WaveGAN : Learning speech representations from gan-based variational auto-encoder for high fidelity flow-based speech synthesis. 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association, 2021. pp. 3241-3245 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH).

@inproceedings{abd0d5c064df42bd8c45277792e467c8,

title = "Glow-WaveGAN: Learning speech representations from gan-based variational auto-encoder for high fidelity flow-based speech synthesis",

abstract = "Current two-stage TTS framework typically integrates an acoustic model with a vocoder - the acoustic model predicts a low resolution intermediate representation such as Mel-spectrum while the vocoder generates waveform from the intermediate representation. Although the intermediate representation is served as a bridge, there still exists critical mismatch between the acoustic model and the vocoder as they are commonly separately learned and work on different distributions of representation, leading to inevitable artifacts in the synthesized speech. In this work, different from using pre-designed intermediate representation in most previous studies, we propose to use VAE combining with GAN to learn a latent representation directly from speech and then utilize a flow-based acoustic model to model the distribution of the latent representation from text. In this way, the mismatch problem is migrated as the two stages work on the same distribution. Results demonstrate that the flow-based acoustic model can exactly model the distribution of our learned speech representation and the proposed TTS framework, namely Glow-WaveGAN, can produce high fidelity speech outperforming the state-of-the-art GAN-based model.",

keywords = "Generative adversarial network, Speech representations, Speech synthesis, Variational auto-encoder",

author = "Jian Cong and Shan Yang and Lei Xie and Dan Su",

note = "Publisher Copyright: {\textcopyright} 2021 ISCA; 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 ; Conference date: 30-08-2021 Through 03-09-2021",

year = "2021",

doi = "10.21437/Interspeech.2021-414",

language = "英语",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

publisher = "International Speech Communication Association",

pages = "3241--3245",

booktitle = "22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021",

}

Cong, J, Yang, S, Xie, L & Su, D 2021, Glow-WaveGAN: Learning speech representations from gan-based variational auto-encoder for high fidelity flow-based speech synthesis. in 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 5, International Speech Communication Association, pp. 3241-3245, 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, Brno, Czech Republic, 30/08/21. https://doi.org/10.21437/Interspeech.2021-414

Glow-WaveGAN: Learning speech representations from gan-based variational auto-encoder for high fidelity flow-based speech synthesis. / Cong, Jian; Yang, Shan; Xie, Lei et al.
22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association, 2021. p. 3241-3245 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 5).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Glow-WaveGAN

T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021

AU - Cong, Jian

AU - Yang, Shan

AU - Xie, Lei

AU - Su, Dan

PY - 2021

Y1 - 2021

N2 - Current two-stage TTS framework typically integrates an acoustic model with a vocoder - the acoustic model predicts a low resolution intermediate representation such as Mel-spectrum while the vocoder generates waveform from the intermediate representation. Although the intermediate representation is served as a bridge, there still exists critical mismatch between the acoustic model and the vocoder as they are commonly separately learned and work on different distributions of representation, leading to inevitable artifacts in the synthesized speech. In this work, different from using pre-designed intermediate representation in most previous studies, we propose to use VAE combining with GAN to learn a latent representation directly from speech and then utilize a flow-based acoustic model to model the distribution of the latent representation from text. In this way, the mismatch problem is migrated as the two stages work on the same distribution. Results demonstrate that the flow-based acoustic model can exactly model the distribution of our learned speech representation and the proposed TTS framework, namely Glow-WaveGAN, can produce high fidelity speech outperforming the state-of-the-art GAN-based model.

AB - Current two-stage TTS framework typically integrates an acoustic model with a vocoder - the acoustic model predicts a low resolution intermediate representation such as Mel-spectrum while the vocoder generates waveform from the intermediate representation. Although the intermediate representation is served as a bridge, there still exists critical mismatch between the acoustic model and the vocoder as they are commonly separately learned and work on different distributions of representation, leading to inevitable artifacts in the synthesized speech. In this work, different from using pre-designed intermediate representation in most previous studies, we propose to use VAE combining with GAN to learn a latent representation directly from speech and then utilize a flow-based acoustic model to model the distribution of the latent representation from text. In this way, the mismatch problem is migrated as the two stages work on the same distribution. Results demonstrate that the flow-based acoustic model can exactly model the distribution of our learned speech representation and the proposed TTS framework, namely Glow-WaveGAN, can produce high fidelity speech outperforming the state-of-the-art GAN-based model.

KW - Generative adversarial network

KW - Speech representations

KW - Speech synthesis

KW - Variational auto-encoder

UR - http://www.scopus.com/inward/record.url?scp=85118309081&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2021-414

DO - 10.21437/Interspeech.2021-414

M3 - 会议稿件

AN - SCOPUS:85118309081

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 3241

EP - 3245

BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021

PB - International Speech Communication Association

Y2 - 30 August 2021 through 3 September 2021

ER -

Cong J, Yang S, Xie L, Su D. Glow-WaveGAN: Learning speech representations from gan-based variational auto-encoder for high fidelity flow-based speech synthesis. In 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association. 2021. p. 3241-3245. (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH). doi: 10.21437/Interspeech.2021-414

Glow-WaveGAN: Learning speech representations from gan-based variational auto-encoder for high fidelity flow-based speech synthesis

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this