On the training of DNN-based average voice model for speech synthesis

Shan Yang; Zhizheng Wu; Lei Xie

doi:10.1109/APSIPA.2016.7820818

On the training of DNN-based average voice model for speech synthesis

Shan Yang, Zhizheng Wu, Lei Xie

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

13 Scopus citations

Abstract

Adaptability and controllability are the major advantages of statistical parametric speech synthesis (SPSS) over unit-selection synthesis. Recently, deep neural networks (DNNs) have significantly improved the performance of SPSS. However, current studies are mainly focusing on the training of speaker-dependent DNNs, which generally requires a significant amount of data from a single speaker. In this work, we perform a systematic analysis of the training of multi-speaker average voice model (AVM), which is the foundation of adaptability and controllability of a DNN-based speech synthesis system. Specifically, we employ the i-vector framework to factorise the speaker specific information, which allows a variety of speakers to share all the hidden layers. And the speaker identity vector is augmented with linguistic features in the DNN input. We systematically analyse the impact of the implementations of i-vectors and speaker normalisation.

Original language	English
Title of host publication	2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	9789881476821
DOIs	https://doi.org/10.1109/APSIPA.2016.7820818
State	Published - 17 Jan 2017
Event	2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016 - Jeju, Korea, Republic of Duration: 13 Dec 2016 → 16 Dec 2016

Publication series

Name	2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016

Conference

Conference	2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
Country/Territory	Korea, Republic of
City	Jeju
Period	13/12/16 → 16/12/16

Access to Document

10.1109/APSIPA.2016.7820818

Cite this

Yang, S., Wu, Z., & Xie, L. (2017). On the training of DNN-based average voice model for speech synthesis. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016 Article 7820818 (2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/APSIPA.2016.7820818

Yang, Shan ; Wu, Zhizheng ; Xie, Lei. / On the training of DNN-based average voice model for speech synthesis. 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. Institute of Electrical and Electronics Engineers Inc., 2017. (2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016).

@inproceedings{fceceb9eaeec41b3baaaefb26e757991,

title = "On the training of DNN-based average voice model for speech synthesis",

abstract = "Adaptability and controllability are the major advantages of statistical parametric speech synthesis (SPSS) over unit-selection synthesis. Recently, deep neural networks (DNNs) have significantly improved the performance of SPSS. However, current studies are mainly focusing on the training of speaker-dependent DNNs, which generally requires a significant amount of data from a single speaker. In this work, we perform a systematic analysis of the training of multi-speaker average voice model (AVM), which is the foundation of adaptability and controllability of a DNN-based speech synthesis system. Specifically, we employ the i-vector framework to factorise the speaker specific information, which allows a variety of speakers to share all the hidden layers. And the speaker identity vector is augmented with linguistic features in the DNN input. We systematically analyse the impact of the implementations of i-vectors and speaker normalisation.",

author = "Shan Yang and Zhizheng Wu and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2016 Asia Pacific Signal and Information Processing Association.; 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016 ; Conference date: 13-12-2016 Through 16-12-2016",

year = "2017",

month = jan,

day = "17",

doi = "10.1109/APSIPA.2016.7820818",

language = "英语",

series = "2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016",

}

Yang, S, Wu, Z & Xie, L 2017, On the training of DNN-based average voice model for speech synthesis. in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016., 7820818, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016, Institute of Electrical and Electronics Engineers Inc., 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016, Jeju, Korea, Republic of, 13/12/16. https://doi.org/10.1109/APSIPA.2016.7820818

On the training of DNN-based average voice model for speech synthesis. / Yang, Shan; Wu, Zhizheng; Xie, Lei.
2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. Institute of Electrical and Electronics Engineers Inc., 2017. 7820818 (2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - On the training of DNN-based average voice model for speech synthesis

AU - Yang, Shan

AU - Wu, Zhizheng

AU - Xie, Lei

PY - 2017/1/17

Y1 - 2017/1/17

N2 - Adaptability and controllability are the major advantages of statistical parametric speech synthesis (SPSS) over unit-selection synthesis. Recently, deep neural networks (DNNs) have significantly improved the performance of SPSS. However, current studies are mainly focusing on the training of speaker-dependent DNNs, which generally requires a significant amount of data from a single speaker. In this work, we perform a systematic analysis of the training of multi-speaker average voice model (AVM), which is the foundation of adaptability and controllability of a DNN-based speech synthesis system. Specifically, we employ the i-vector framework to factorise the speaker specific information, which allows a variety of speakers to share all the hidden layers. And the speaker identity vector is augmented with linguistic features in the DNN input. We systematically analyse the impact of the implementations of i-vectors and speaker normalisation.

AB - Adaptability and controllability are the major advantages of statistical parametric speech synthesis (SPSS) over unit-selection synthesis. Recently, deep neural networks (DNNs) have significantly improved the performance of SPSS. However, current studies are mainly focusing on the training of speaker-dependent DNNs, which generally requires a significant amount of data from a single speaker. In this work, we perform a systematic analysis of the training of multi-speaker average voice model (AVM), which is the foundation of adaptability and controllability of a DNN-based speech synthesis system. Specifically, we employ the i-vector framework to factorise the speaker specific information, which allows a variety of speakers to share all the hidden layers. And the speaker identity vector is augmented with linguistic features in the DNN input. We systematically analyse the impact of the implementations of i-vectors and speaker normalisation.

UR - http://www.scopus.com/inward/record.url?scp=85013762788&partnerID=8YFLogxK

U2 - 10.1109/APSIPA.2016.7820818

DO - 10.1109/APSIPA.2016.7820818

M3 - 会议稿件

AN - SCOPUS:85013762788

T3 - 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016

BT - 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016

Y2 - 13 December 2016 through 16 December 2016

ER -

Yang S, Wu Z, Xie L. On the training of DNN-based average voice model for speech synthesis. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. Institute of Electrical and Electronics Engineers Inc. 2017. 7820818. (2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016). doi: 10.1109/APSIPA.2016.7820818

On the training of DNN-based average voice model for speech synthesis

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this