Building a mixed-lingual neural TTS system with only monolingual data

Liumeng Xue; Wei Song; Guanghui Xu; Lei Xie; Zhizheng Wu

doi:10.21437/Interspeech.2019-3191

Building a mixed-lingual neural TTS system with only monolingual data

Liumeng Xue, Wei Song, Guanghui Xu, Lei Xie, Zhizheng Wu

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

15 引用（Scopus）

摘要

When deploying a Chinese neural Text-to-Speech (TTS) system, one of the challenges is to synthesize Chinese utterances with English phrases or words embedded. This paper looks into the problem in the encoder-decoder framework when only monolingual data from a target speaker is available. Specifically, we view the problem from two aspects: speaker consistency within an utterance and naturalness. We start the investigation with an average voice model which is built from multispeaker monolingual data, i.e., Mandarin and English data. On the basis of that, we look into speaker embedding for speaker consistency within an utterance and phoneme embedding for naturalness and intelligibility, and study the choice of data for model training. We report the findings and discuss the challenges to build a mixed-lingual TTS system with only monolingual data.

源语言	英语
页（从-至）	2060-2064
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷	2019-September
DOI	https://doi.org/10.21437/Interspeech.2019-3191
出版状态	已出版 - 2019
活动	20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 - Graz, 奥地利期限: 15 9月 2019 → 19 9月 2019

访问文件

10.21437/Interspeech.2019-3191

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{49610a67b5a94449943c65627f4a0aeb,

title = "Building a mixed-lingual neural TTS system with only monolingual data",

abstract = "When deploying a Chinese neural Text-to-Speech (TTS) system, one of the challenges is to synthesize Chinese utterances with English phrases or words embedded. This paper looks into the problem in the encoder-decoder framework when only monolingual data from a target speaker is available. Specifically, we view the problem from two aspects: speaker consistency within an utterance and naturalness. We start the investigation with an average voice model which is built from multispeaker monolingual data, i.e., Mandarin and English data. On the basis of that, we look into speaker embedding for speaker consistency within an utterance and phoneme embedding for naturalness and intelligibility, and study the choice of data for model training. We report the findings and discuss the challenges to build a mixed-lingual TTS system with only monolingual data.",

keywords = "Encoder-decoder, Mixed-lingual, Speech synthesis",

author = "Liumeng Xue and Wei Song and Guanghui Xu and Lei Xie and Zhizheng Wu",

note = "Publisher Copyright: Copyright {\textcopyright} 2019 ISCA; 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 ; Conference date: 15-09-2019 Through 19-09-2019",

year = "2019",

doi = "10.21437/Interspeech.2019-3191",

language = "英语",

volume = "2019-September",

pages = "2060--2064",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Building a mixed-lingual neural TTS system with only monolingual data

AU - Xue, Liumeng

AU - Song, Wei

AU - Xu, Guanghui

AU - Xie, Lei

AU - Wu, Zhizheng

PY - 2019

Y1 - 2019

N2 - When deploying a Chinese neural Text-to-Speech (TTS) system, one of the challenges is to synthesize Chinese utterances with English phrases or words embedded. This paper looks into the problem in the encoder-decoder framework when only monolingual data from a target speaker is available. Specifically, we view the problem from two aspects: speaker consistency within an utterance and naturalness. We start the investigation with an average voice model which is built from multispeaker monolingual data, i.e., Mandarin and English data. On the basis of that, we look into speaker embedding for speaker consistency within an utterance and phoneme embedding for naturalness and intelligibility, and study the choice of data for model training. We report the findings and discuss the challenges to build a mixed-lingual TTS system with only monolingual data.

AB - When deploying a Chinese neural Text-to-Speech (TTS) system, one of the challenges is to synthesize Chinese utterances with English phrases or words embedded. This paper looks into the problem in the encoder-decoder framework when only monolingual data from a target speaker is available. Specifically, we view the problem from two aspects: speaker consistency within an utterance and naturalness. We start the investigation with an average voice model which is built from multispeaker monolingual data, i.e., Mandarin and English data. On the basis of that, we look into speaker embedding for speaker consistency within an utterance and phoneme embedding for naturalness and intelligibility, and study the choice of data for model training. We report the findings and discuss the challenges to build a mixed-lingual TTS system with only monolingual data.

KW - Encoder-decoder

KW - Mixed-lingual

KW - Speech synthesis

UR - http://www.scopus.com/inward/record.url?scp=85094765018&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2019-3191

DO - 10.21437/Interspeech.2019-3191

M3 - 会议文章

AN - SCOPUS:85094765018

SN - 2308-457X

VL - 2019-September

SP - 2060

EP - 2064

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019

Y2 - 15 September 2019 through 19 September 2019

ER -

Building a mixed-lingual neural TTS system with only monolingual data

摘要

访问文件

其它文件与链接

指纹

引用此