TY - CONF
T1 - On the impact of phoneme alignment in DNN-based speech synthesis
AU - Li, Mei
AU - Wu, Zhizheng
AU - Xie, Lei
N1 - Publisher Copyright:
© 2016, 9th ISCA Speech Synthesis Workshop, SSW 2016. All rights reserved.
PY - 2016
Y1 - 2016
N2 - Recently, deep neural networks (DNNs) have significantly improved the performance of acoustic modeling in statistical parametric speech synthesis (SPSS). However, in current implementations, when training a DNN-based speech synthesis system, phonetic transcripts are required to be aligned with the corresponding speech frames to obtain the phonetic segmentation, called phoneme alignment. Such an alignment is usually obtained by forced alignment based on hidden Markov models (HMMs) since manual alignment is labor-intensive and timeconsuming. In this work, we study the impact of phoneme alignment on the DNN-based speech synthesis system. Specifically, we compare the performances of different DNN-based speech synthesis systems, which use manual alignment and HMM-based forced alignment from three types of labels: HMM mono-phone, tri-phone and full-context. Objective and subjective evaluations are conducted in term of the naturalness of synthesized speech to compare the performances of different alignments.
AB - Recently, deep neural networks (DNNs) have significantly improved the performance of acoustic modeling in statistical parametric speech synthesis (SPSS). However, in current implementations, when training a DNN-based speech synthesis system, phonetic transcripts are required to be aligned with the corresponding speech frames to obtain the phonetic segmentation, called phoneme alignment. Such an alignment is usually obtained by forced alignment based on hidden Markov models (HMMs) since manual alignment is labor-intensive and timeconsuming. In this work, we study the impact of phoneme alignment on the DNN-based speech synthesis system. Specifically, we compare the performances of different DNN-based speech synthesis systems, which use manual alignment and HMM-based forced alignment from three types of labels: HMM mono-phone, tri-phone and full-context. Objective and subjective evaluations are conducted in term of the naturalness of synthesized speech to compare the performances of different alignments.
KW - acoustic modeling
KW - deep neural networks
KW - phoneme alignment
KW - Speech synthesis
UR - http://www.scopus.com/inward/record.url?scp=85040200109&partnerID=8YFLogxK
M3 - 论文
AN - SCOPUS:85040200109
SP - 196
EP - 201
T2 - 9th ISCA Speech Synthesis Workshop, SSW 2016
Y2 - 13 September 2016 through 15 September 2016
ER -