On the impact of phoneme alignment in DNN-based speech synthesis

Mei Li; Zhizheng Wu; Lei Xie

On the impact of phoneme alignment in DNN-based speech synthesis

Mei Li, Zhizheng Wu, Lei Xie

计算机学院

科研成果: 会议稿件 › 论文 › 同行评审

4 引用（Scopus）

摘要

Recently, deep neural networks (DNNs) have significantly improved the performance of acoustic modeling in statistical parametric speech synthesis (SPSS). However, in current implementations, when training a DNN-based speech synthesis system, phonetic transcripts are required to be aligned with the corresponding speech frames to obtain the phonetic segmentation, called phoneme alignment. Such an alignment is usually obtained by forced alignment based on hidden Markov models (HMMs) since manual alignment is labor-intensive and timeconsuming. In this work, we study the impact of phoneme alignment on the DNN-based speech synthesis system. Specifically, we compare the performances of different DNN-based speech synthesis systems, which use manual alignment and HMM-based forced alignment from three types of labels: HMM mono-phone, tri-phone and full-context. Objective and subjective evaluations are conducted in term of the naturalness of synthesized speech to compare the performances of different alignments.

源语言	英语
页	196-201
页数	6
出版状态	已出版 - 2016
活动	9th ISCA Speech Synthesis Workshop, SSW 2016 - Sunnyvale, 美国期限: 13 9月 2016 → 15 9月 2016

会议

会议	9th ISCA Speech Synthesis Workshop, SSW 2016
国家/地区	美国
市	Sunnyvale
时期	13/09/16 → 15/09/16

其它文件与链接

链接到 Scopus 的出版物

引用此

@conference{432b584bf4024a12a8a24d39ca5aacc4,

title = "On the impact of phoneme alignment in DNN-based speech synthesis",

abstract = "Recently, deep neural networks (DNNs) have significantly improved the performance of acoustic modeling in statistical parametric speech synthesis (SPSS). However, in current implementations, when training a DNN-based speech synthesis system, phonetic transcripts are required to be aligned with the corresponding speech frames to obtain the phonetic segmentation, called phoneme alignment. Such an alignment is usually obtained by forced alignment based on hidden Markov models (HMMs) since manual alignment is labor-intensive and timeconsuming. In this work, we study the impact of phoneme alignment on the DNN-based speech synthesis system. Specifically, we compare the performances of different DNN-based speech synthesis systems, which use manual alignment and HMM-based forced alignment from three types of labels: HMM mono-phone, tri-phone and full-context. Objective and subjective evaluations are conducted in term of the naturalness of synthesized speech to compare the performances of different alignments.",

keywords = "acoustic modeling, deep neural networks, phoneme alignment, Speech synthesis",

author = "Mei Li and Zhizheng Wu and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2016, 9th ISCA Speech Synthesis Workshop, SSW 2016. All rights reserved.; 9th ISCA Speech Synthesis Workshop, SSW 2016 ; Conference date: 13-09-2016 Through 15-09-2016",

year = "2016",

language = "英语",

pages = "196--201",

}

TY - CONF

T1 - On the impact of phoneme alignment in DNN-based speech synthesis

AU - Li, Mei

AU - Wu, Zhizheng

AU - Xie, Lei

PY - 2016

Y1 - 2016

N2 - Recently, deep neural networks (DNNs) have significantly improved the performance of acoustic modeling in statistical parametric speech synthesis (SPSS). However, in current implementations, when training a DNN-based speech synthesis system, phonetic transcripts are required to be aligned with the corresponding speech frames to obtain the phonetic segmentation, called phoneme alignment. Such an alignment is usually obtained by forced alignment based on hidden Markov models (HMMs) since manual alignment is labor-intensive and timeconsuming. In this work, we study the impact of phoneme alignment on the DNN-based speech synthesis system. Specifically, we compare the performances of different DNN-based speech synthesis systems, which use manual alignment and HMM-based forced alignment from three types of labels: HMM mono-phone, tri-phone and full-context. Objective and subjective evaluations are conducted in term of the naturalness of synthesized speech to compare the performances of different alignments.

AB - Recently, deep neural networks (DNNs) have significantly improved the performance of acoustic modeling in statistical parametric speech synthesis (SPSS). However, in current implementations, when training a DNN-based speech synthesis system, phonetic transcripts are required to be aligned with the corresponding speech frames to obtain the phonetic segmentation, called phoneme alignment. Such an alignment is usually obtained by forced alignment based on hidden Markov models (HMMs) since manual alignment is labor-intensive and timeconsuming. In this work, we study the impact of phoneme alignment on the DNN-based speech synthesis system. Specifically, we compare the performances of different DNN-based speech synthesis systems, which use manual alignment and HMM-based forced alignment from three types of labels: HMM mono-phone, tri-phone and full-context. Objective and subjective evaluations are conducted in term of the naturalness of synthesized speech to compare the performances of different alignments.

KW - acoustic modeling

KW - deep neural networks

KW - phoneme alignment

KW - Speech synthesis

UR - http://www.scopus.com/inward/record.url?scp=85040200109&partnerID=8YFLogxK

M3 - 论文

AN - SCOPUS:85040200109

SP - 196

EP - 201

T2 - 9th ISCA Speech Synthesis Workshop, SSW 2016

Y2 - 13 September 2016 through 15 September 2016

ER -

On the impact of phoneme alignment in DNN-based speech synthesis

摘要

会议

其它文件与链接

指纹

引用此