Exploiting deep sentential context for expressive end-to-end speech synthesis

Fengyu Yang, Shan Yang, Qinghua Wu, Yujun Wang, Lei Xie

科研成果: 书/报告/会议事项章节会议稿件同行评审

6 引用 (Scopus)

摘要

Attention-based seq2seq text-to-speech systems, especially those use self-attention networks (SAN), have achieved state-of-art performance. But an expressive corpus with rich prosody is still challenging to model as 1) prosodic aspects, which span across different sentential granularities and mainly determine acoustic expressiveness, are difficult to quantize and label and 2) the current seq2seq framework extracts prosodic information solely from a text encoder, which is easily collapsed to an averaged expression for expressive contents. In this paper, we propose a context extractor, which is built upon SAN-based text encoder, to sufficiently exploit the sentential context over an expressive corpus for seq2seq-based TTS. Our context extractor first collects prosodic-related sentential context information from different SAN layers and then aggregates them to learn a comprehensive sentence representation to enhance the expressiveness of the final generated speech. Specifically, we investigate two methods of context aggregation: 1) direct aggregation which directly concatenates the outputs of different SAN layers, and 2) weighted aggregation which uses multi-head attention to automatically learn contributions for different SAN layers. Experiments on two expressive corpora show that our approach can produce more natural speech with much richer prosodic variations, and weighted aggregation is more superior in modeling expressivity.

源语言英语
主期刊名Interspeech 2020
出版商International Speech Communication Association
3436-3440
页数5
ISBN(印刷版)9781713820697
DOI
出版状态已出版 - 2020
活动21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, 中国
期限: 25 10月 202029 10月 2020

出版系列

姓名Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
2020-October
ISSN(印刷版)2308-457X
ISSN(电子版)1990-9772

会议

会议21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
国家/地区中国
Shanghai
时期25/10/2029/10/20

指纹

探究 'Exploiting deep sentential context for expressive end-to-end speech synthesis' 的科研主题。它们共同构成独一无二的指纹。

引用此