Improving mandarin end-To-end speech synthesis by self-Attention and learnable gaussian bias

Fengyu Yang; Shan Yang; Pengcheng Zhu; Pengju Yan; Lei Xie

doi:10.1109/ASRU46091.2019.9003949

Improving mandarin end-To-end speech synthesis by self-Attention and learnable gaussian bias

Fengyu Yang, Shan Yang, Pengcheng Zhu, Pengju Yan, Lei Xie

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

15 Scopus citations

Abstract

Compared to conventional speech synthesis, end-To-end speech synthesis has achieved much better naturalness with more simplified system building pipeline. End-To-end framework can generate natural speech directly from characters for English. But for other languages like Chinese, recent studies have indicated that extra engineering features are still needed for model robustness and naturalness, e.g, word boundaries and prosody boundaries, which makes the front-end pipeline as complicated as the traditional approach. To maintain the naturalness of generated speech and discard language-specific expertise as much as possible, in Mandarin TTS, we introduce a novel self-Attention based encoder with learnable Gaussian bias in Tacotron. We evaluate different systems with and without complex prosody information and results show that the proposed approach has the ability to generate stable and natural speech with minimum language-dependent front-end modules.

Original language	English
Title of host publication	2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	208-213
Number of pages	6
ISBN (Electronic)	9781728103068
DOIs	https://doi.org/10.1109/ASRU46091.2019.9003949
State	Published - Dec 2019
Event	2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Singapore, Singapore Duration: 15 Dec 2019 → 18 Dec 2019

Publication series

Name	2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings

Conference

Conference	2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019
Country/Territory	Singapore
City	Singapore
Period	15/12/19 → 18/12/19

Keywords

end-To-end
Gaussian bias
self-Attention
speech synthesis
Tacotron

Access to Document

10.1109/ASRU46091.2019.9003949

Cite this

Yang, F., Yang, S., Zhu, P., Yan, P., & Xie, L. (2019). Improving mandarin end-To-end speech synthesis by self-Attention and learnable gaussian bias. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings (pp. 208-213). Article 9003949 (2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASRU46091.2019.9003949

Yang, Fengyu ; Yang, Shan ; Zhu, Pengcheng et al. / Improving mandarin end-To-end speech synthesis by self-Attention and learnable gaussian bias. 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 208-213 (2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings).

@inproceedings{09f1c3f0864d4fa9923a0486d1e10549,

title = "Improving mandarin end-To-end speech synthesis by self-Attention and learnable gaussian bias",

abstract = "Compared to conventional speech synthesis, end-To-end speech synthesis has achieved much better naturalness with more simplified system building pipeline. End-To-end framework can generate natural speech directly from characters for English. But for other languages like Chinese, recent studies have indicated that extra engineering features are still needed for model robustness and naturalness, e.g, word boundaries and prosody boundaries, which makes the front-end pipeline as complicated as the traditional approach. To maintain the naturalness of generated speech and discard language-specific expertise as much as possible, in Mandarin TTS, we introduce a novel self-Attention based encoder with learnable Gaussian bias in Tacotron. We evaluate different systems with and without complex prosody information and results show that the proposed approach has the ability to generate stable and natural speech with minimum language-dependent front-end modules.",

keywords = "end-To-end, Gaussian bias, self-Attention, speech synthesis, Tacotron",

author = "Fengyu Yang and Shan Yang and Pengcheng Zhu and Pengju Yan and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2019 IEEE.; 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 ; Conference date: 15-12-2019 Through 18-12-2019",

year = "2019",

month = dec,

doi = "10.1109/ASRU46091.2019.9003949",

language = "英语",

series = "2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "208--213",

booktitle = "2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings",

}

Yang, F, Yang, S, Zhu, P, Yan, P & Xie, L 2019, Improving mandarin end-To-end speech synthesis by self-Attention and learnable gaussian bias. in 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings., 9003949, 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings, Institute of Electrical and Electronics Engineers Inc., pp. 208-213, 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019, Singapore, Singapore, 15/12/19. https://doi.org/10.1109/ASRU46091.2019.9003949

Improving mandarin end-To-end speech synthesis by self-Attention and learnable gaussian bias. / Yang, Fengyu; Yang, Shan; Zhu, Pengcheng et al.
2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. p. 208-213 9003949 (2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Improving mandarin end-To-end speech synthesis by self-Attention and learnable gaussian bias

AU - Yang, Fengyu

AU - Yang, Shan

AU - Zhu, Pengcheng

AU - Yan, Pengju

AU - Xie, Lei

PY - 2019/12

Y1 - 2019/12

N2 - Compared to conventional speech synthesis, end-To-end speech synthesis has achieved much better naturalness with more simplified system building pipeline. End-To-end framework can generate natural speech directly from characters for English. But for other languages like Chinese, recent studies have indicated that extra engineering features are still needed for model robustness and naturalness, e.g, word boundaries and prosody boundaries, which makes the front-end pipeline as complicated as the traditional approach. To maintain the naturalness of generated speech and discard language-specific expertise as much as possible, in Mandarin TTS, we introduce a novel self-Attention based encoder with learnable Gaussian bias in Tacotron. We evaluate different systems with and without complex prosody information and results show that the proposed approach has the ability to generate stable and natural speech with minimum language-dependent front-end modules.

AB - Compared to conventional speech synthesis, end-To-end speech synthesis has achieved much better naturalness with more simplified system building pipeline. End-To-end framework can generate natural speech directly from characters for English. But for other languages like Chinese, recent studies have indicated that extra engineering features are still needed for model robustness and naturalness, e.g, word boundaries and prosody boundaries, which makes the front-end pipeline as complicated as the traditional approach. To maintain the naturalness of generated speech and discard language-specific expertise as much as possible, in Mandarin TTS, we introduce a novel self-Attention based encoder with learnable Gaussian bias in Tacotron. We evaluate different systems with and without complex prosody information and results show that the proposed approach has the ability to generate stable and natural speech with minimum language-dependent front-end modules.

KW - end-To-end

KW - Gaussian bias

KW - self-Attention

KW - speech synthesis

KW - Tacotron

UR - http://www.scopus.com/inward/record.url?scp=85081553624&partnerID=8YFLogxK

U2 - 10.1109/ASRU46091.2019.9003949

DO - 10.1109/ASRU46091.2019.9003949

M3 - 会议稿件

AN - SCOPUS:85081553624

T3 - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings

SP - 208

EP - 213

BT - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019

Y2 - 15 December 2019 through 18 December 2019

ER -

Yang F, Yang S, Zhu P, Yan P, Xie L. Improving mandarin end-To-end speech synthesis by self-Attention and learnable gaussian bias. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2019. p. 208-213. 9003949. (2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings). doi: 10.1109/ASRU46091.2019.9003949

Improving mandarin end-To-end speech synthesis by self-Attention and learnable gaussian bias

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this