On the training of DNN-based average voice model for speech synthesis

Shan Yang, Zhizheng Wu, Lei Xie

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

13 Scopus citations

Abstract

Adaptability and controllability are the major advantages of statistical parametric speech synthesis (SPSS) over unit-selection synthesis. Recently, deep neural networks (DNNs) have significantly improved the performance of SPSS. However, current studies are mainly focusing on the training of speaker-dependent DNNs, which generally requires a significant amount of data from a single speaker. In this work, we perform a systematic analysis of the training of multi-speaker average voice model (AVM), which is the foundation of adaptability and controllability of a DNN-based speech synthesis system. Specifically, we employ the i-vector framework to factorise the speaker specific information, which allows a variety of speakers to share all the hidden layers. And the speaker identity vector is augmented with linguistic features in the DNN input. We systematically analyse the impact of the implementations of i-vectors and speaker normalisation.

Original languageEnglish
Title of host publication2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9789881476821
DOIs
StatePublished - 17 Jan 2017
Event2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016 - Jeju, Korea, Republic of
Duration: 13 Dec 201616 Dec 2016

Publication series

Name2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016

Conference

Conference2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
Country/TerritoryKorea, Republic of
CityJeju
Period13/12/1616/12/16

Fingerprint

Dive into the research topics of 'On the training of DNN-based average voice model for speech synthesis'. Together they form a unique fingerprint.

Cite this