On the impact of phoneme alignment in DNN-based speech synthesis

Mei Li, Zhizheng Wu, Lei Xie

Research output: Contribution to conferencePaperpeer-review

4 Scopus citations

Abstract

Recently, deep neural networks (DNNs) have significantly improved the performance of acoustic modeling in statistical parametric speech synthesis (SPSS). However, in current implementations, when training a DNN-based speech synthesis system, phonetic transcripts are required to be aligned with the corresponding speech frames to obtain the phonetic segmentation, called phoneme alignment. Such an alignment is usually obtained by forced alignment based on hidden Markov models (HMMs) since manual alignment is labor-intensive and timeconsuming. In this work, we study the impact of phoneme alignment on the DNN-based speech synthesis system. Specifically, we compare the performances of different DNN-based speech synthesis systems, which use manual alignment and HMM-based forced alignment from three types of labels: HMM mono-phone, tri-phone and full-context. Objective and subjective evaluations are conducted in term of the naturalness of synthesized speech to compare the performances of different alignments.

Original languageEnglish
Pages196-201
Number of pages6
StatePublished - 2016
Event9th ISCA Speech Synthesis Workshop, SSW 2016 - Sunnyvale, United States
Duration: 13 Sep 201615 Sep 2016

Conference

Conference9th ISCA Speech Synthesis Workshop, SSW 2016
Country/TerritoryUnited States
CitySunnyvale
Period13/09/1615/09/16

Keywords

  • acoustic modeling
  • deep neural networks
  • phoneme alignment
  • Speech synthesis

Fingerprint

Dive into the research topics of 'On the impact of phoneme alignment in DNN-based speech synthesis'. Together they form a unique fingerprint.

Cite this