A new GAN-based end-to-end TTS training algorithm

Haohan Guo; Frank K. Soong; Lei He; Lei Xie

doi:10.21437/Interspeech.2019-2176

A new GAN-based end-to-end TTS training algorithm

Haohan Guo, Frank K. Soong, Lei He, Lei Xie

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

20 引用（Scopus）

摘要

End-to-end, autoregressive model-based TTS has shown significant performance improvements over the conventional ones. However, the autoregressive module training is affected by the exposure bias, or the mismatch between different distributions of real and predicted data. While real data is provided in training, in testing, predicted data is available only. By introducing both real and generated data sequences in training, we can alleviate the effects of the exposure bias. We propose to use Generative Adversarial Network (GAN) along with the idea of”Professor Forcing” in training. A discriminator in GAN is jointly trained to equalize the difference between real and the predicted data. In AB subjective listening test, the results show that the new approach is preferred over the standard transfer learning with a CMOS improvement of 0.1. Sentence level intelligibility tests also show significant improvement in a pathological test set. The GAN-trained new model is shown more stable than the baseline to produce better alignments for the Tacotron output.

源语言	英语
页（从-至）	1288-1292
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷	2019-September
DOI	https://doi.org/10.21437/Interspeech.2019-2176
出版状态	已出版 - 2019
活动	20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 - Graz, 奥地利期限: 15 9月 2019 → 19 9月 2019

访问文件

10.21437/Interspeech.2019-2176

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{042d93ac747646fea0fe816655bb9835,

title = "A new GAN-based end-to-end TTS training algorithm",

abstract = "End-to-end, autoregressive model-based TTS has shown significant performance improvements over the conventional ones. However, the autoregressive module training is affected by the exposure bias, or the mismatch between different distributions of real and predicted data. While real data is provided in training, in testing, predicted data is available only. By introducing both real and generated data sequences in training, we can alleviate the effects of the exposure bias. We propose to use Generative Adversarial Network (GAN) along with the idea of”Professor Forcing” in training. A discriminator in GAN is jointly trained to equalize the difference between real and the predicted data. In AB subjective listening test, the results show that the new approach is preferred over the standard transfer learning with a CMOS improvement of 0.1. Sentence level intelligibility tests also show significant improvement in a pathological test set. The GAN-trained new model is shown more stable than the baseline to produce better alignments for the Tacotron output.",

keywords = "Adversarial training, Auto-regressive model, End-to-end TTS synthesis, Generative adversarial model, Speech synthesis",

author = "Haohan Guo and Soong, {Frank K.} and Lei He and Lei Xie",

note = "Publisher Copyright: Copyright {\textcopyright} 2019 ISCA; 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 ; Conference date: 15-09-2019 Through 19-09-2019",

year = "2019",

doi = "10.21437/Interspeech.2019-2176",

language = "英语",

volume = "2019-September",

pages = "1288--1292",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - A new GAN-based end-to-end TTS training algorithm

AU - Guo, Haohan

AU - Soong, Frank K.

AU - He, Lei

AU - Xie, Lei

PY - 2019

Y1 - 2019

N2 - End-to-end, autoregressive model-based TTS has shown significant performance improvements over the conventional ones. However, the autoregressive module training is affected by the exposure bias, or the mismatch between different distributions of real and predicted data. While real data is provided in training, in testing, predicted data is available only. By introducing both real and generated data sequences in training, we can alleviate the effects of the exposure bias. We propose to use Generative Adversarial Network (GAN) along with the idea of”Professor Forcing” in training. A discriminator in GAN is jointly trained to equalize the difference between real and the predicted data. In AB subjective listening test, the results show that the new approach is preferred over the standard transfer learning with a CMOS improvement of 0.1. Sentence level intelligibility tests also show significant improvement in a pathological test set. The GAN-trained new model is shown more stable than the baseline to produce better alignments for the Tacotron output.

AB - End-to-end, autoregressive model-based TTS has shown significant performance improvements over the conventional ones. However, the autoregressive module training is affected by the exposure bias, or the mismatch between different distributions of real and predicted data. While real data is provided in training, in testing, predicted data is available only. By introducing both real and generated data sequences in training, we can alleviate the effects of the exposure bias. We propose to use Generative Adversarial Network (GAN) along with the idea of”Professor Forcing” in training. A discriminator in GAN is jointly trained to equalize the difference between real and the predicted data. In AB subjective listening test, the results show that the new approach is preferred over the standard transfer learning with a CMOS improvement of 0.1. Sentence level intelligibility tests also show significant improvement in a pathological test set. The GAN-trained new model is shown more stable than the baseline to produce better alignments for the Tacotron output.

KW - Adversarial training

KW - Auto-regressive model

KW - End-to-end TTS synthesis

KW - Generative adversarial model

KW - Speech synthesis

UR - http://www.scopus.com/inward/record.url?scp=85074725276&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2019-2176

DO - 10.21437/Interspeech.2019-2176

M3 - 会议文章

AN - SCOPUS:85074725276

SN - 2308-457X

VL - 2019-September

SP - 1288

EP - 1292

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019

Y2 - 15 September 2019 through 19 September 2019

ER -

A new GAN-based end-to-end TTS training algorithm

摘要

访问文件

其它文件与链接

指纹

引用此