Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework

Shan Yang; Lei Xie; Xiao Chen; Xiaoyan Lou; Xuan Zhu; Dongyan Huang; Haizhou Li

doi:10.1109/ASRU.2017.8269003

Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework

Shan Yang, Lei Xie, Xiao Chen, Xiaoyan Lou, Xuan Zhu, Dongyan Huang, Haizhou Li

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

35 引用（Scopus）

摘要

In this paper, we aim at improving the performance of synthesized speech in statistical parametric speech synthesis (SPSS) based on a generative adversarial network (GAN). In particular, we propose a novel architecture combining the traditional acoustic loss function and the GAN's discriminative loss under a multi-task learning (MTL) framework. The mean squared error (MSE) is usually used to estimate the parameters of deep neural networks, which only considers the numerical difference between the raw audio and the synthesized one. To mitigate this problem, we introduce the GAN as a second task to determine if the input is a natural speech with specific conditions. In this MTL framework, the MSE optimization improves the stability of GAN, and at the same time GAN produces samples with a distribution closer to natural speech. Listening tests show that the multi-task architecture can generate more natural speech that satisfies human perception than the conventional methods.

源语言	英语
主期刊名	2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
出版商	Institute of Electrical and Electronics Engineers Inc.
页	685-691
页数	7
ISBN（电子版）	9781509047888
DOI	https://doi.org/10.1109/ASRU.2017.8269003
出版状态	已出版 - 2 7月 2017
活动	2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Okinawa, 日本期限: 16 12月 2017 → 20 12月 2017

出版系列

姓名	2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
卷	2018-January

会议

会议	2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017
国家/地区	日本
市	Okinawa
时期	16/12/17 → 20/12/17

访问文件

10.1109/ASRU.2017.8269003

其它文件与链接

链接到 Scopus 的出版物

引用此

Yang, S., Xie, L., Chen, X., Lou, X., Zhu, X., Huang, D., & Li, H. (2017). Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework. 在 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings (页码 685-691). (2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings; 卷 2018-January). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASRU.2017.8269003

Yang, Shan ; Xie, Lei ; Chen, Xiao 等. / Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework. 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. 页码 685-691 (2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings).

@inproceedings{351187a9629744e5b44fa7a4e4eefdfa,

title = "Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework",

abstract = "In this paper, we aim at improving the performance of synthesized speech in statistical parametric speech synthesis (SPSS) based on a generative adversarial network (GAN). In particular, we propose a novel architecture combining the traditional acoustic loss function and the GAN's discriminative loss under a multi-task learning (MTL) framework. The mean squared error (MSE) is usually used to estimate the parameters of deep neural networks, which only considers the numerical difference between the raw audio and the synthesized one. To mitigate this problem, we introduce the GAN as a second task to determine if the input is a natural speech with specific conditions. In this MTL framework, the MSE optimization improves the stability of GAN, and at the same time GAN produces samples with a distribution closer to natural speech. Listening tests show that the multi-task architecture can generate more natural speech that satisfies human perception than the conventional methods.",

keywords = "conditional generative adversarial network, deep neural network, generative adversarial network, multi-task learning, Statistical parametric speech synthesis",

author = "Shan Yang and Lei Xie and Xiao Chen and Xiaoyan Lou and Xuan Zhu and Dongyan Huang and Haizhou Li",

note = "Publisher Copyright: {\textcopyright} 2017 IEEE.; 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 ; Conference date: 16-12-2017 Through 20-12-2017",

year = "2017",

month = jul,

day = "2",

doi = "10.1109/ASRU.2017.8269003",

language = "英语",

series = "2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "685--691",

booktitle = "2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings",

}

Yang, S, Xie, L, Chen, X, Lou, X, Zhu, X, Huang, D & Li, H 2017, Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework. 在 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings, 卷 2018-January, Institute of Electrical and Electronics Engineers Inc., 页码 685-691, 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017, Okinawa, 日本, 16/12/17. https://doi.org/10.1109/ASRU.2017.8269003

Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework. / Yang, Shan; Xie, Lei; Chen, Xiao 等.
2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. 页码 685-691 (2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings; 卷 2018-January).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework

AU - Yang, Shan

AU - Xie, Lei

AU - Chen, Xiao

AU - Lou, Xiaoyan

AU - Zhu, Xuan

AU - Huang, Dongyan

AU - Li, Haizhou

PY - 2017/7/2

Y1 - 2017/7/2

N2 - In this paper, we aim at improving the performance of synthesized speech in statistical parametric speech synthesis (SPSS) based on a generative adversarial network (GAN). In particular, we propose a novel architecture combining the traditional acoustic loss function and the GAN's discriminative loss under a multi-task learning (MTL) framework. The mean squared error (MSE) is usually used to estimate the parameters of deep neural networks, which only considers the numerical difference between the raw audio and the synthesized one. To mitigate this problem, we introduce the GAN as a second task to determine if the input is a natural speech with specific conditions. In this MTL framework, the MSE optimization improves the stability of GAN, and at the same time GAN produces samples with a distribution closer to natural speech. Listening tests show that the multi-task architecture can generate more natural speech that satisfies human perception than the conventional methods.

AB - In this paper, we aim at improving the performance of synthesized speech in statistical parametric speech synthesis (SPSS) based on a generative adversarial network (GAN). In particular, we propose a novel architecture combining the traditional acoustic loss function and the GAN's discriminative loss under a multi-task learning (MTL) framework. The mean squared error (MSE) is usually used to estimate the parameters of deep neural networks, which only considers the numerical difference between the raw audio and the synthesized one. To mitigate this problem, we introduce the GAN as a second task to determine if the input is a natural speech with specific conditions. In this MTL framework, the MSE optimization improves the stability of GAN, and at the same time GAN produces samples with a distribution closer to natural speech. Listening tests show that the multi-task architecture can generate more natural speech that satisfies human perception than the conventional methods.

KW - conditional generative adversarial network

KW - deep neural network

KW - generative adversarial network

KW - multi-task learning

KW - Statistical parametric speech synthesis

UR - http://www.scopus.com/inward/record.url?scp=85047518812&partnerID=8YFLogxK

U2 - 10.1109/ASRU.2017.8269003

DO - 10.1109/ASRU.2017.8269003

M3 - 会议稿件

AN - SCOPUS:85047518812

T3 - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings

SP - 685

EP - 691

BT - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017

Y2 - 16 December 2017 through 20 December 2017

ER -

Yang S, Xie L, Chen X, Lou X, Zhu X, Huang D 等. Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework. 在 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2017. 页码 685-691. (2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings). doi: 10.1109/ASRU.2017.8269003

Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此