TY - GEN
T1 - Learning Hierarchical Representations for Expressive Speaking Style in End-To-End Speech Synthesis
AU - An, Xiaochun
AU - Wang, Yuxuan
AU - Yang, Shan
AU - Ma, Zejun
AU - Xie, Lei
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/12
Y1 - 2019/12
N2 - Although Global Style Tokens (GSTs) are a recently-proposed method to uncover expressive factors of variation in speaking style, they are a mixture of style attributes without explicitly considering the factorization of multiple-level speaking styles. In this work, we introduce a hierarchical GST architecture with residuals to Tacotron, which learns multiple-level disentangled representations to model and control different style granularities in synthesized speech. We make hierarchical evaluations conditioned on individual tokens from different GST layers. As the number of layers increases, we tend to observe a coarse to fine style decomposition. For example, the first GST layer learns a good representation of speaker IDs while finer speaking style or emotion variations can be found in higher-level layers. Meanwhile, the proposed model shows good performance of style transfer.
AB - Although Global Style Tokens (GSTs) are a recently-proposed method to uncover expressive factors of variation in speaking style, they are a mixture of style attributes without explicitly considering the factorization of multiple-level speaking styles. In this work, we introduce a hierarchical GST architecture with residuals to Tacotron, which learns multiple-level disentangled representations to model and control different style granularities in synthesized speech. We make hierarchical evaluations conditioned on individual tokens from different GST layers. As the number of layers increases, we tend to observe a coarse to fine style decomposition. For example, the first GST layer learns a good representation of speaker IDs while finer speaking style or emotion variations can be found in higher-level layers. Meanwhile, the proposed model shows good performance of style transfer.
KW - disentangled representations
KW - hierarchical GST
KW - Speaking style
KW - style transfer
UR - http://www.scopus.com/inward/record.url?scp=85081555052&partnerID=8YFLogxK
U2 - 10.1109/ASRU46091.2019.9003859
DO - 10.1109/ASRU46091.2019.9003859
M3 - 会议稿件
AN - SCOPUS:85081555052
T3 - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings
SP - 184
EP - 191
BT - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019
Y2 - 15 December 2019 through 18 December 2019
ER -