TY - JOUR
T1 - Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling
AU - Jiang, Yuepeng
AU - Li, Tao
AU - Yang, Fengyu
AU - Xie, Lei
AU - Meng, Meng
AU - Wang, Yujun
N1 - Publisher Copyright:
© 2024 International Speech Communication Association. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timbre is a global attribute closely linked to expressiveness, we adopt a global vector to model speaker timbre while guiding prosody modeling. Besides, given that prosody contains both global consistency and local variations, we introduce a diffusion model as the pitch predictor and employ a prosody adaptor to model prosody hierarchically, further enhancing the prosody quality of the synthesized speech. Experimental results show that our model not only maintains comparable timbre quality to the baseline but also exhibits better naturalness and expressiveness. The synthesized samples can be found at: https://rxy-j.github.io/HPMD-TTS/.
AB - Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timbre is a global attribute closely linked to expressiveness, we adopt a global vector to model speaker timbre while guiding prosody modeling. Besides, given that prosody contains both global consistency and local variations, we introduce a diffusion model as the pitch predictor and employ a prosody adaptor to model prosody hierarchically, further enhancing the prosody quality of the synthesized speech. Experimental results show that our model not only maintains comparable timbre quality to the baseline but also exhibits better naturalness and expressiveness. The synthesized samples can be found at: https://rxy-j.github.io/HPMD-TTS/.
KW - denoising diffusion probabilistic model
KW - prosody modeling
KW - speech synthesis
KW - zero-shot
UR - http://www.scopus.com/inward/record.url?scp=85214834305&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2024-2506
DO - 10.21437/Interspeech.2024-2506
M3 - 会议文章
AN - SCOPUS:85214834305
SN - 2308-457X
SP - 2300
EP - 2304
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 25th Interspeech Conferece 2024
Y2 - 1 September 2024 through 5 September 2024
ER -