Exploiting deep sentential context for expressive end-to-end speech synthesis

Fengyu Yang; Shan Yang; Qinghua Wu; Yujun Wang; Lei Xie

doi:10.21437/Interspeech.2020-2423

Exploiting deep sentential context for expressive end-to-end speech synthesis

Fengyu Yang, Shan Yang, Qinghua Wu, Yujun Wang, Lei Xie

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

6 引用（Scopus）

摘要

Attention-based seq2seq text-to-speech systems, especially those use self-attention networks (SAN), have achieved state-of-art performance. But an expressive corpus with rich prosody is still challenging to model as 1) prosodic aspects, which span across different sentential granularities and mainly determine acoustic expressiveness, are difficult to quantize and label and 2) the current seq2seq framework extracts prosodic information solely from a text encoder, which is easily collapsed to an averaged expression for expressive contents. In this paper, we propose a context extractor, which is built upon SAN-based text encoder, to sufficiently exploit the sentential context over an expressive corpus for seq2seq-based TTS. Our context extractor first collects prosodic-related sentential context information from different SAN layers and then aggregates them to learn a comprehensive sentence representation to enhance the expressiveness of the final generated speech. Specifically, we investigate two methods of context aggregation: 1) direct aggregation which directly concatenates the outputs of different SAN layers, and 2) weighted aggregation which uses multi-head attention to automatically learn contributions for different SAN layers. Experiments on two expressive corpora show that our approach can produce more natural speech with much richer prosodic variations, and weighted aggregation is more superior in modeling expressivity.

源语言	英语
主期刊名	Interspeech 2020
出版商	International Speech Communication Association
页	3436-3440
页数	5
ISBN（印刷版）	9781713820697
DOI	https://doi.org/10.21437/Interspeech.2020-2423
出版状态	已出版 - 2020
活动	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, 中国期限: 25 10月 2020 → 29 10月 2020

出版系列

姓名	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷	2020-October
ISSN（印刷版）	2308-457X
ISSN（电子版）	1990-9772

会议

会议	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
国家/地区	中国
市	Shanghai
时期	25/10/20 → 29/10/20

访问文件

10.21437/Interspeech.2020-2423

其它文件与链接

链接到 Scopus 的出版物

引用此

Yang, F., Yang, S., Wu, Q., Wang, Y., & Xie, L. (2020). Exploiting deep sentential context for expressive end-to-end speech synthesis. 在 Interspeech 2020 (页码 3436-3440). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; 卷 2020-October). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2020-2423

@inproceedings{eb76d3698349472982760f43ee48ad96,

title = "Exploiting deep sentential context for expressive end-to-end speech synthesis",

abstract = "Attention-based seq2seq text-to-speech systems, especially those use self-attention networks (SAN), have achieved state-of-art performance. But an expressive corpus with rich prosody is still challenging to model as 1) prosodic aspects, which span across different sentential granularities and mainly determine acoustic expressiveness, are difficult to quantize and label and 2) the current seq2seq framework extracts prosodic information solely from a text encoder, which is easily collapsed to an averaged expression for expressive contents. In this paper, we propose a context extractor, which is built upon SAN-based text encoder, to sufficiently exploit the sentential context over an expressive corpus for seq2seq-based TTS. Our context extractor first collects prosodic-related sentential context information from different SAN layers and then aggregates them to learn a comprehensive sentence representation to enhance the expressiveness of the final generated speech. Specifically, we investigate two methods of context aggregation: 1) direct aggregation which directly concatenates the outputs of different SAN layers, and 2) weighted aggregation which uses multi-head attention to automatically learn contributions for different SAN layers. Experiments on two expressive corpora show that our approach can produce more natural speech with much richer prosodic variations, and weighted aggregation is more superior in modeling expressivity.",

keywords = "IndexTerms: speech synthesis, Prosody, Self-attention network",

author = "Fengyu Yang and Shan Yang and Qinghua Wu and Yujun Wang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2020 ISCA; 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 ; Conference date: 25-10-2020 Through 29-10-2020",

year = "2020",

doi = "10.21437/Interspeech.2020-2423",

language = "英语",

isbn = "9781713820697",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

publisher = "International Speech Communication Association",

pages = "3436--3440",

booktitle = "Interspeech 2020",

}

Yang, F, Yang, S, Wu, Q, Wang, Y & Xie, L 2020, Exploiting deep sentential context for expressive end-to-end speech synthesis. 在 Interspeech 2020. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 卷 2020-October, International Speech Communication Association, 页码 3436-3440, 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020, Shanghai, 中国, 25/10/20. https://doi.org/10.21437/Interspeech.2020-2423

Exploiting deep sentential context for expressive end-to-end speech synthesis. / Yang, Fengyu; Yang, Shan; Wu, Qinghua 等.
Interspeech 2020. International Speech Communication Association, 2020. 页码 3436-3440 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; 卷 2020-October).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Exploiting deep sentential context for expressive end-to-end speech synthesis

AU - Yang, Fengyu

AU - Yang, Shan

AU - Wu, Qinghua

AU - Wang, Yujun

AU - Xie, Lei

PY - 2020

Y1 - 2020

N2 - Attention-based seq2seq text-to-speech systems, especially those use self-attention networks (SAN), have achieved state-of-art performance. But an expressive corpus with rich prosody is still challenging to model as 1) prosodic aspects, which span across different sentential granularities and mainly determine acoustic expressiveness, are difficult to quantize and label and 2) the current seq2seq framework extracts prosodic information solely from a text encoder, which is easily collapsed to an averaged expression for expressive contents. In this paper, we propose a context extractor, which is built upon SAN-based text encoder, to sufficiently exploit the sentential context over an expressive corpus for seq2seq-based TTS. Our context extractor first collects prosodic-related sentential context information from different SAN layers and then aggregates them to learn a comprehensive sentence representation to enhance the expressiveness of the final generated speech. Specifically, we investigate two methods of context aggregation: 1) direct aggregation which directly concatenates the outputs of different SAN layers, and 2) weighted aggregation which uses multi-head attention to automatically learn contributions for different SAN layers. Experiments on two expressive corpora show that our approach can produce more natural speech with much richer prosodic variations, and weighted aggregation is more superior in modeling expressivity.

AB - Attention-based seq2seq text-to-speech systems, especially those use self-attention networks (SAN), have achieved state-of-art performance. But an expressive corpus with rich prosody is still challenging to model as 1) prosodic aspects, which span across different sentential granularities and mainly determine acoustic expressiveness, are difficult to quantize and label and 2) the current seq2seq framework extracts prosodic information solely from a text encoder, which is easily collapsed to an averaged expression for expressive contents. In this paper, we propose a context extractor, which is built upon SAN-based text encoder, to sufficiently exploit the sentential context over an expressive corpus for seq2seq-based TTS. Our context extractor first collects prosodic-related sentential context information from different SAN layers and then aggregates them to learn a comprehensive sentence representation to enhance the expressiveness of the final generated speech. Specifically, we investigate two methods of context aggregation: 1) direct aggregation which directly concatenates the outputs of different SAN layers, and 2) weighted aggregation which uses multi-head attention to automatically learn contributions for different SAN layers. Experiments on two expressive corpora show that our approach can produce more natural speech with much richer prosodic variations, and weighted aggregation is more superior in modeling expressivity.

KW - IndexTerms: speech synthesis

KW - Prosody

KW - Self-attention network

UR - http://www.scopus.com/inward/record.url?scp=85098231286&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2020-2423

DO - 10.21437/Interspeech.2020-2423

M3 - 会议稿件

AN - SCOPUS:85098231286

SN - 9781713820697

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 3436

EP - 3440

BT - Interspeech 2020

PB - International Speech Communication Association

T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020

Y2 - 25 October 2020 through 29 October 2020

ER -

Exploiting deep sentential context for expressive end-to-end speech synthesis

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此