Text-aware and Context-aware Expressive Audiobook Speech Synthesis

Dake Guo; Xinfa Zhu; Liumeng Xue; Yongmao Zhang; Wenjie Tian; Lei Xie

doi:10.21437/Interspeech.2024-1862

Text-aware and Context-aware Expressive Audiobook Speech Synthesis

Dake Guo, Xinfa Zhu, Liumeng Xue, Yongmao Zhang, Wenjie Tian, Lei Xie

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

摘要

Recent advances in text-to-speech have significantly improved the expressiveness of synthetic speech. However, a major challenge remains in generating speech that captures the diverse styles exhibited by professional narrators in audiobooks without relying on manually labeled data or reference speech. To address this problem, we propose a text-aware and context-aware (TACA) style modeling approach for expressive audiobook speech synthesis. We first establish a text-aware style space to cover diverse styles via contrastive learning with the supervision of the speech style. Meanwhile, we adopt a context encoder to incorporate cross-sentence information and the style embedding obtained from text. Finally, we introduce the context encoder to two typical TTS models, VITS-based TTS and language model-based TTS. Experimental results demonstrate that our proposed approach can effectively capture diverse styles and coherent prosody, and consequently improves naturalness and expressiveness in audiobook speech synthesis.

源语言	英语
页（从-至）	1790-1794
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOI	https://doi.org/10.21437/Interspeech.2024-1862
出版状态	已出版 - 2024
活动	25th Interspeech Conferece 2024 - Kos Island, 希腊期限: 1 9月 2024 → 5 9月 2024

访问文件

10.21437/Interspeech.2024-1862

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{47a85906e5cc4a1382604b9063a8107b,

title = "Text-aware and Context-aware Expressive Audiobook Speech Synthesis",

abstract = "Recent advances in text-to-speech have significantly improved the expressiveness of synthetic speech. However, a major challenge remains in generating speech that captures the diverse styles exhibited by professional narrators in audiobooks without relying on manually labeled data or reference speech. To address this problem, we propose a text-aware and context-aware (TACA) style modeling approach for expressive audiobook speech synthesis. We first establish a text-aware style space to cover diverse styles via contrastive learning with the supervision of the speech style. Meanwhile, we adopt a context encoder to incorporate cross-sentence information and the style embedding obtained from text. Finally, we introduce the context encoder to two typical TTS models, VITS-based TTS and language model-based TTS. Experimental results demonstrate that our proposed approach can effectively capture diverse styles and coherent prosody, and consequently improves naturalness and expressiveness in audiobook speech synthesis.",

keywords = "audiobook speech synthesis, context-aware, style modeling, text-aware",

author = "Dake Guo and Xinfa Zhu and Liumeng Xue and Yongmao Zhang and Wenjie Tian and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2024 International Speech Communication Association. All rights reserved.; 25th Interspeech Conferece 2024 ; Conference date: 01-09-2024 Through 05-09-2024",

year = "2024",

doi = "10.21437/Interspeech.2024-1862",

language = "英语",

pages = "1790--1794",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Text-aware and Context-aware Expressive Audiobook Speech Synthesis

AU - Guo, Dake

AU - Zhu, Xinfa

AU - Xue, Liumeng

AU - Zhang, Yongmao

AU - Tian, Wenjie

AU - Xie, Lei

PY - 2024

Y1 - 2024

N2 - Recent advances in text-to-speech have significantly improved the expressiveness of synthetic speech. However, a major challenge remains in generating speech that captures the diverse styles exhibited by professional narrators in audiobooks without relying on manually labeled data or reference speech. To address this problem, we propose a text-aware and context-aware (TACA) style modeling approach for expressive audiobook speech synthesis. We first establish a text-aware style space to cover diverse styles via contrastive learning with the supervision of the speech style. Meanwhile, we adopt a context encoder to incorporate cross-sentence information and the style embedding obtained from text. Finally, we introduce the context encoder to two typical TTS models, VITS-based TTS and language model-based TTS. Experimental results demonstrate that our proposed approach can effectively capture diverse styles and coherent prosody, and consequently improves naturalness and expressiveness in audiobook speech synthesis.

AB - Recent advances in text-to-speech have significantly improved the expressiveness of synthetic speech. However, a major challenge remains in generating speech that captures the diverse styles exhibited by professional narrators in audiobooks without relying on manually labeled data or reference speech. To address this problem, we propose a text-aware and context-aware (TACA) style modeling approach for expressive audiobook speech synthesis. We first establish a text-aware style space to cover diverse styles via contrastive learning with the supervision of the speech style. Meanwhile, we adopt a context encoder to incorporate cross-sentence information and the style embedding obtained from text. Finally, we introduce the context encoder to two typical TTS models, VITS-based TTS and language model-based TTS. Experimental results demonstrate that our proposed approach can effectively capture diverse styles and coherent prosody, and consequently improves naturalness and expressiveness in audiobook speech synthesis.

KW - audiobook speech synthesis

KW - context-aware

KW - style modeling

KW - text-aware

UR - http://www.scopus.com/inward/record.url?scp=85214824102&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2024-1862

DO - 10.21437/Interspeech.2024-1862

M3 - 会议文章

AN - SCOPUS:85214824102

SN - 2308-457X

SP - 1790

EP - 1794

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 25th Interspeech Conferece 2024

Y2 - 1 September 2024 through 5 September 2024

ER -

Text-aware and Context-aware Expressive Audiobook Speech Synthesis

摘要

访问文件

其它文件与链接

指纹

引用此