Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

Yuepeng Jiang; Tao Li; Fengyu Yang; Lei Xie; Meng Meng; Yujun Wang

doi:10.21437/Interspeech.2024-2506

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

Yuepeng Jiang, Tao Li, Fengyu Yang, Lei Xie, Meng Meng, Yujun Wang

School of Computer Science

Research output: Contribution to journal › Conference article › peer-review

Abstract

Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timbre is a global attribute closely linked to expressiveness, we adopt a global vector to model speaker timbre while guiding prosody modeling. Besides, given that prosody contains both global consistency and local variations, we introduce a diffusion model as the pitch predictor and employ a prosody adaptor to model prosody hierarchically, further enhancing the prosody quality of the synthesized speech. Experimental results show that our model not only maintains comparable timbre quality to the baseline but also exhibits better naturalness and expressiveness. The synthesized samples can be found at: https://rxy-j.github.io/HPMD-TTS/.

Original language	English
Pages (from-to)	2300-2304
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs	https://doi.org/10.21437/Interspeech.2024-2506
State	Published - 2024
Event	25th Interspeech Conferece 2024 - Kos Island, Greece Duration: 1 Sep 2024 → 5 Sep 2024

Keywords

denoising diffusion probabilistic model
prosody modeling
speech synthesis
zero-shot

Access to Document

10.21437/Interspeech.2024-2506

Cite this

@article{f10f155ea4d34d21b5bb75268af9b428,

title = "Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling",

abstract = "Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timbre is a global attribute closely linked to expressiveness, we adopt a global vector to model speaker timbre while guiding prosody modeling. Besides, given that prosody contains both global consistency and local variations, we introduce a diffusion model as the pitch predictor and employ a prosody adaptor to model prosody hierarchically, further enhancing the prosody quality of the synthesized speech. Experimental results show that our model not only maintains comparable timbre quality to the baseline but also exhibits better naturalness and expressiveness. The synthesized samples can be found at: https://rxy-j.github.io/HPMD-TTS/.",

keywords = "denoising diffusion probabilistic model, prosody modeling, speech synthesis, zero-shot",

author = "Yuepeng Jiang and Tao Li and Fengyu Yang and Lei Xie and Meng Meng and Yujun Wang",

note = "Publisher Copyright: {\textcopyright} 2024 International Speech Communication Association. All rights reserved.; 25th Interspeech Conferece 2024 ; Conference date: 01-09-2024 Through 05-09-2024",

year = "2024",

doi = "10.21437/Interspeech.2024-2506",

language = "英语",

pages = "2300--2304",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

AU - Jiang, Yuepeng

AU - Li, Tao

AU - Yang, Fengyu

AU - Xie, Lei

AU - Meng, Meng

AU - Wang, Yujun

PY - 2024

Y1 - 2024

N2 - Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timbre is a global attribute closely linked to expressiveness, we adopt a global vector to model speaker timbre while guiding prosody modeling. Besides, given that prosody contains both global consistency and local variations, we introduce a diffusion model as the pitch predictor and employ a prosody adaptor to model prosody hierarchically, further enhancing the prosody quality of the synthesized speech. Experimental results show that our model not only maintains comparable timbre quality to the baseline but also exhibits better naturalness and expressiveness. The synthesized samples can be found at: https://rxy-j.github.io/HPMD-TTS/.

AB - Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timbre is a global attribute closely linked to expressiveness, we adopt a global vector to model speaker timbre while guiding prosody modeling. Besides, given that prosody contains both global consistency and local variations, we introduce a diffusion model as the pitch predictor and employ a prosody adaptor to model prosody hierarchically, further enhancing the prosody quality of the synthesized speech. Experimental results show that our model not only maintains comparable timbre quality to the baseline but also exhibits better naturalness and expressiveness. The synthesized samples can be found at: https://rxy-j.github.io/HPMD-TTS/.

KW - denoising diffusion probabilistic model

KW - prosody modeling

KW - speech synthesis

KW - zero-shot

UR - http://www.scopus.com/inward/record.url?scp=85214834305&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2024-2506

DO - 10.21437/Interspeech.2024-2506

M3 - 会议文章

AN - SCOPUS:85214834305

SN - 2308-457X

SP - 2300

EP - 2304

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 25th Interspeech Conferece 2024

Y2 - 1 September 2024 through 5 September 2024

ER -

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this