On the training of DNN-based average voice model for speech synthesis

Shan Yang, Zhizheng Wu, Lei Xie

科研成果: 书/报告/会议事项章节会议稿件同行评审

13 引用 (Scopus)

摘要

Adaptability and controllability are the major advantages of statistical parametric speech synthesis (SPSS) over unit-selection synthesis. Recently, deep neural networks (DNNs) have significantly improved the performance of SPSS. However, current studies are mainly focusing on the training of speaker-dependent DNNs, which generally requires a significant amount of data from a single speaker. In this work, we perform a systematic analysis of the training of multi-speaker average voice model (AVM), which is the foundation of adaptability and controllability of a DNN-based speech synthesis system. Specifically, we employ the i-vector framework to factorise the speaker specific information, which allows a variety of speakers to share all the hidden layers. And the speaker identity vector is augmented with linguistic features in the DNN input. We systematically analyse the impact of the implementations of i-vectors and speaker normalisation.

源语言英语
主期刊名2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
出版商Institute of Electrical and Electronics Engineers Inc.
ISBN(电子版)9789881476821
DOI
出版状态已出版 - 17 1月 2017
活动2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016 - Jeju, 韩国
期限: 13 12月 201616 12月 2016

出版系列

姓名2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016

会议

会议2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
国家/地区韩国
Jeju
时期13/12/1616/12/16

指纹

探究 'On the training of DNN-based average voice model for speech synthesis' 的科研主题。它们共同构成独一无二的指纹。

引用此