Learning Hierarchical Representations for Expressive Speaking Style in End-To-End Speech Synthesis

Xiaochun An, Yuxuan Wang, Shan Yang, Zejun Ma, Lei Xie

科研成果: 书/报告/会议事项章节会议稿件同行评审

17 引用 (Scopus)

摘要

Although Global Style Tokens (GSTs) are a recently-proposed method to uncover expressive factors of variation in speaking style, they are a mixture of style attributes without explicitly considering the factorization of multiple-level speaking styles. In this work, we introduce a hierarchical GST architecture with residuals to Tacotron, which learns multiple-level disentangled representations to model and control different style granularities in synthesized speech. We make hierarchical evaluations conditioned on individual tokens from different GST layers. As the number of layers increases, we tend to observe a coarse to fine style decomposition. For example, the first GST layer learns a good representation of speaker IDs while finer speaking style or emotion variations can be found in higher-level layers. Meanwhile, the proposed model shows good performance of style transfer.

源语言英语
主期刊名2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings
出版商Institute of Electrical and Electronics Engineers Inc.
184-191
页数8
ISBN(电子版)9781728103068
DOI
出版状态已出版 - 12月 2019
活动2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Singapore, 新加坡
期限: 15 12月 201918 12月 2019

出版系列

姓名2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings

会议

会议2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019
国家/地区新加坡
Singapore
时期15/12/1918/12/19

指纹

探究 'Learning Hierarchical Representations for Expressive Speaking Style in End-To-End Speech Synthesis' 的科研主题。它们共同构成独一无二的指纹。

引用此