Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework

Shan Yang; Lei Xie; Xiao Chen; Xiaoyan Lou; Xuan Zhu; Dongyan Huang; Haizhou Li

doi:10.1109/ASRU.2017.8269003

Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework

Shan Yang, Lei Xie, Xiao Chen, Xiaoyan Lou, Xuan Zhu, Dongyan Huang, Haizhou Li

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

35 Scopus citations

Abstract

In this paper, we aim at improving the performance of synthesized speech in statistical parametric speech synthesis (SPSS) based on a generative adversarial network (GAN). In particular, we propose a novel architecture combining the traditional acoustic loss function and the GAN's discriminative loss under a multi-task learning (MTL) framework. The mean squared error (MSE) is usually used to estimate the parameters of deep neural networks, which only considers the numerical difference between the raw audio and the synthesized one. To mitigate this problem, we introduce the GAN as a second task to determine if the input is a natural speech with specific conditions. In this MTL framework, the MSE optimization improves the stability of GAN, and at the same time GAN produces samples with a distribution closer to natural speech. Listening tests show that the multi-task architecture can generate more natural speech that satisfies human perception than the conventional methods.

Original language	English
Title of host publication	2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	685-691
Number of pages	7
ISBN (Electronic)	9781509047888
DOIs	https://doi.org/10.1109/ASRU.2017.8269003
State	Published - 2 Jul 2017
Event	2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Okinawa, Japan Duration: 16 Dec 2017 → 20 Dec 2017

Publication series

Name	2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
Volume	2018-January

Conference

Conference	2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017
Country/Territory	Japan
City	Okinawa
Period	16/12/17 → 20/12/17

Keywords

conditional generative adversarial network
deep neural network
generative adversarial network
multi-task learning
Statistical parametric speech synthesis

Access to Document

10.1109/ASRU.2017.8269003

Cite this

Yang, S., Xie, L., Chen, X., Lou, X., Zhu, X., Huang, D., & Li, H. (2017). Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings (pp. 685-691). (2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings; Vol. 2018-January). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASRU.2017.8269003

Yang, Shan ; Xie, Lei ; Chen, Xiao et al. / Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework. 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 685-691 (2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings).

@inproceedings{351187a9629744e5b44fa7a4e4eefdfa,

title = "Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework",

abstract = "In this paper, we aim at improving the performance of synthesized speech in statistical parametric speech synthesis (SPSS) based on a generative adversarial network (GAN). In particular, we propose a novel architecture combining the traditional acoustic loss function and the GAN's discriminative loss under a multi-task learning (MTL) framework. The mean squared error (MSE) is usually used to estimate the parameters of deep neural networks, which only considers the numerical difference between the raw audio and the synthesized one. To mitigate this problem, we introduce the GAN as a second task to determine if the input is a natural speech with specific conditions. In this MTL framework, the MSE optimization improves the stability of GAN, and at the same time GAN produces samples with a distribution closer to natural speech. Listening tests show that the multi-task architecture can generate more natural speech that satisfies human perception than the conventional methods.",

keywords = "conditional generative adversarial network, deep neural network, generative adversarial network, multi-task learning, Statistical parametric speech synthesis",

author = "Shan Yang and Lei Xie and Xiao Chen and Xiaoyan Lou and Xuan Zhu and Dongyan Huang and Haizhou Li",

note = "Publisher Copyright: {\textcopyright} 2017 IEEE.; 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 ; Conference date: 16-12-2017 Through 20-12-2017",

year = "2017",

month = jul,

day = "2",

doi = "10.1109/ASRU.2017.8269003",

language = "英语",

series = "2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "685--691",

booktitle = "2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings",

}

Yang, S, Xie, L, Chen, X, Lou, X, Zhu, X, Huang, D & Li, H 2017, Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework. in 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings, vol. 2018-January, Institute of Electrical and Electronics Engineers Inc., pp. 685-691, 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017, Okinawa, Japan, 16/12/17. https://doi.org/10.1109/ASRU.2017.8269003

Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework. / Yang, Shan; Xie, Lei; Chen, Xiao et al.
2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. p. 685-691 (2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings; Vol. 2018-January).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework

AU - Yang, Shan

AU - Xie, Lei

AU - Chen, Xiao

AU - Lou, Xiaoyan

AU - Zhu, Xuan

AU - Huang, Dongyan

AU - Li, Haizhou

PY - 2017/7/2

Y1 - 2017/7/2

N2 - In this paper, we aim at improving the performance of synthesized speech in statistical parametric speech synthesis (SPSS) based on a generative adversarial network (GAN). In particular, we propose a novel architecture combining the traditional acoustic loss function and the GAN's discriminative loss under a multi-task learning (MTL) framework. The mean squared error (MSE) is usually used to estimate the parameters of deep neural networks, which only considers the numerical difference between the raw audio and the synthesized one. To mitigate this problem, we introduce the GAN as a second task to determine if the input is a natural speech with specific conditions. In this MTL framework, the MSE optimization improves the stability of GAN, and at the same time GAN produces samples with a distribution closer to natural speech. Listening tests show that the multi-task architecture can generate more natural speech that satisfies human perception than the conventional methods.

AB - In this paper, we aim at improving the performance of synthesized speech in statistical parametric speech synthesis (SPSS) based on a generative adversarial network (GAN). In particular, we propose a novel architecture combining the traditional acoustic loss function and the GAN's discriminative loss under a multi-task learning (MTL) framework. The mean squared error (MSE) is usually used to estimate the parameters of deep neural networks, which only considers the numerical difference between the raw audio and the synthesized one. To mitigate this problem, we introduce the GAN as a second task to determine if the input is a natural speech with specific conditions. In this MTL framework, the MSE optimization improves the stability of GAN, and at the same time GAN produces samples with a distribution closer to natural speech. Listening tests show that the multi-task architecture can generate more natural speech that satisfies human perception than the conventional methods.

KW - conditional generative adversarial network

KW - deep neural network

KW - generative adversarial network

KW - multi-task learning

KW - Statistical parametric speech synthesis

UR - http://www.scopus.com/inward/record.url?scp=85047518812&partnerID=8YFLogxK

U2 - 10.1109/ASRU.2017.8269003

DO - 10.1109/ASRU.2017.8269003

M3 - 会议稿件

AN - SCOPUS:85047518812

T3 - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings

SP - 685

EP - 691

BT - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017

Y2 - 16 December 2017 through 20 December 2017

ER -

Yang S, Xie L, Chen X, Lou X, Zhu X, Huang D et al. Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2017. p. 685-691. (2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings). doi: 10.1109/ASRU.2017.8269003

Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this