Learning Hierarchical Representations for Expressive Speaking Style in End-To-End Speech Synthesis

Xiaochun An; Yuxuan Wang; Shan Yang; Zejun Ma; Lei Xie

doi:10.1109/ASRU46091.2019.9003859

Learning Hierarchical Representations for Expressive Speaking Style in End-To-End Speech Synthesis

Xiaochun An, Yuxuan Wang, Shan Yang, Zejun Ma, Lei Xie

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

17 引用（Scopus）

摘要

Although Global Style Tokens (GSTs) are a recently-proposed method to uncover expressive factors of variation in speaking style, they are a mixture of style attributes without explicitly considering the factorization of multiple-level speaking styles. In this work, we introduce a hierarchical GST architecture with residuals to Tacotron, which learns multiple-level disentangled representations to model and control different style granularities in synthesized speech. We make hierarchical evaluations conditioned on individual tokens from different GST layers. As the number of layers increases, we tend to observe a coarse to fine style decomposition. For example, the first GST layer learns a good representation of speaker IDs while finer speaking style or emotion variations can be found in higher-level layers. Meanwhile, the proposed model shows good performance of style transfer.

源语言	英语
主期刊名	2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings
出版商	Institute of Electrical and Electronics Engineers Inc.
页	184-191
页数	8
ISBN（电子版）	9781728103068
DOI	https://doi.org/10.1109/ASRU46091.2019.9003859
出版状态	已出版 - 12月 2019
活动	2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Singapore, 新加坡期限: 15 12月 2019 → 18 12月 2019

出版系列

姓名	2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings

会议

会议	2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019
国家/地区	新加坡
市	Singapore
时期	15/12/19 → 18/12/19

访问文件

10.1109/ASRU46091.2019.9003859

其它文件与链接

链接到 Scopus 的出版物

引用此

An, X., Wang, Y., Yang, S., Ma, Z., & Xie, L. (2019). Learning Hierarchical Representations for Expressive Speaking Style in End-To-End Speech Synthesis. 在 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings (页码 184-191). 文章 9003859 (2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASRU46091.2019.9003859

An, Xiaochun ; Wang, Yuxuan ; Yang, Shan 等. / Learning Hierarchical Representations for Expressive Speaking Style in End-To-End Speech Synthesis. 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. 页码 184-191 (2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings).

@inproceedings{45aca5f9e6d741e3a394d8471d6527b4,

title = "Learning Hierarchical Representations for Expressive Speaking Style in End-To-End Speech Synthesis",

abstract = "Although Global Style Tokens (GSTs) are a recently-proposed method to uncover expressive factors of variation in speaking style, they are a mixture of style attributes without explicitly considering the factorization of multiple-level speaking styles. In this work, we introduce a hierarchical GST architecture with residuals to Tacotron, which learns multiple-level disentangled representations to model and control different style granularities in synthesized speech. We make hierarchical evaluations conditioned on individual tokens from different GST layers. As the number of layers increases, we tend to observe a coarse to fine style decomposition. For example, the first GST layer learns a good representation of speaker IDs while finer speaking style or emotion variations can be found in higher-level layers. Meanwhile, the proposed model shows good performance of style transfer.",

keywords = "disentangled representations, hierarchical GST, Speaking style, style transfer",

author = "Xiaochun An and Yuxuan Wang and Shan Yang and Zejun Ma and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2019 IEEE.; 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 ; Conference date: 15-12-2019 Through 18-12-2019",

year = "2019",

month = dec,

doi = "10.1109/ASRU46091.2019.9003859",

language = "英语",

series = "2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "184--191",

booktitle = "2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings",

}

An, X, Wang, Y, Yang, S, Ma, Z & Xie, L 2019, Learning Hierarchical Representations for Expressive Speaking Style in End-To-End Speech Synthesis. 在 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings., 9003859, 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings, Institute of Electrical and Electronics Engineers Inc., 页码 184-191, 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019, Singapore, 新加坡, 15/12/19. https://doi.org/10.1109/ASRU46091.2019.9003859

Learning Hierarchical Representations for Expressive Speaking Style in End-To-End Speech Synthesis. / An, Xiaochun; Wang, Yuxuan; Yang, Shan 等.
2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. 页码 184-191 9003859 (2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Learning Hierarchical Representations for Expressive Speaking Style in End-To-End Speech Synthesis

AU - An, Xiaochun

AU - Wang, Yuxuan

AU - Yang, Shan

AU - Ma, Zejun

AU - Xie, Lei

PY - 2019/12

Y1 - 2019/12

N2 - Although Global Style Tokens (GSTs) are a recently-proposed method to uncover expressive factors of variation in speaking style, they are a mixture of style attributes without explicitly considering the factorization of multiple-level speaking styles. In this work, we introduce a hierarchical GST architecture with residuals to Tacotron, which learns multiple-level disentangled representations to model and control different style granularities in synthesized speech. We make hierarchical evaluations conditioned on individual tokens from different GST layers. As the number of layers increases, we tend to observe a coarse to fine style decomposition. For example, the first GST layer learns a good representation of speaker IDs while finer speaking style or emotion variations can be found in higher-level layers. Meanwhile, the proposed model shows good performance of style transfer.

AB - Although Global Style Tokens (GSTs) are a recently-proposed method to uncover expressive factors of variation in speaking style, they are a mixture of style attributes without explicitly considering the factorization of multiple-level speaking styles. In this work, we introduce a hierarchical GST architecture with residuals to Tacotron, which learns multiple-level disentangled representations to model and control different style granularities in synthesized speech. We make hierarchical evaluations conditioned on individual tokens from different GST layers. As the number of layers increases, we tend to observe a coarse to fine style decomposition. For example, the first GST layer learns a good representation of speaker IDs while finer speaking style or emotion variations can be found in higher-level layers. Meanwhile, the proposed model shows good performance of style transfer.

KW - disentangled representations

KW - hierarchical GST

KW - Speaking style

KW - style transfer

UR - http://www.scopus.com/inward/record.url?scp=85081555052&partnerID=8YFLogxK

U2 - 10.1109/ASRU46091.2019.9003859

DO - 10.1109/ASRU46091.2019.9003859

M3 - 会议稿件

AN - SCOPUS:85081555052

T3 - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings

SP - 184

EP - 191

BT - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019

Y2 - 15 December 2019 through 18 December 2019

ER -

An X, Wang Y, Yang S, Ma Z, Xie L. Learning Hierarchical Representations for Expressive Speaking Style in End-To-End Speech Synthesis. 在 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2019. 页码 184-191. 9003859. (2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings). doi: 10.1109/ASRU46091.2019.9003859

Learning Hierarchical Representations for Expressive Speaking Style in End-To-End Speech Synthesis

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此