Controllable Emotion Transfer for End-to-End Speech Synthesis

Tao Li; Shan Yang; Liumeng Xue; Lei Xie

doi:10.1109/ISCSLP49672.2021.9362069

Controllable Emotion Transfer for End-to-End Speech Synthesis

Tao Li, Shan Yang, Liumeng Xue, Lei Xie

School of Computer Science

Northwestern Polytechnical University Xian

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

61 Scopus citations

Abstract

Emotion embedding space learned from references is a straight-forward approach for emotion transfer in encoder-decoder structured emotional text to speech (TTS) systems. However, the transferred emotion in the synthetic speech is not accurate and expressive enough with emotion category confusions. Moreover, it is hard to select an appropriate reference to deliver desired emotion strength. To solve these problems, we propose a novel approach based on Tacotron. First, we plug two emotion classifiers - one after the reference encoder, one after the decoder output - to enhance the emotion-discriminative ability of the emotion embedding and the predicted mel-spectrum. Second, we adopt style loss to measure the difference between the generated and reference mel-spectrum. The emotion strength in the synthetic speech can be controlled by adjusting the value of the emotion embedding as the emotion embedding can be viewed as the feature map of the mel-spectrum. Experiments on emotion transfer and strength control have shown that the synthetic speech of the proposed method is more accurate and expressive with less emotion category confusions and the control of emotion strength is more salient to listeners.

Original language	English
Title of host publication	2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	9781728169941
DOIs	https://doi.org/10.1109/ISCSLP49672.2021.9362069
State	Published - 24 Jan 2021
Event	12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021 - Hong Kong, Hong Kong Duration: 24 Jan 2021 → 27 Jan 2021

Publication series

Name	2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021

Conference

Conference	12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021
Country/Territory	Hong Kong
City	Hong Kong
Period	24/01/21 → 27/01/21

Keywords

emotion strength control
emotion transfer
speech synthesis
style loss

Access to Document

10.1109/ISCSLP49672.2021.9362069

Cite this

Li, T., Yang, S., Xue, L., & Xie, L. (2021). Controllable Emotion Transfer for End-to-End Speech Synthesis. In 2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021 Article 9362069 (2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ISCSLP49672.2021.9362069

@inproceedings{0533e588f94a40e99c37076ef5ddd96d,

title = "Controllable Emotion Transfer for End-to-End Speech Synthesis",

abstract = "Emotion embedding space learned from references is a straight-forward approach for emotion transfer in encoder-decoder structured emotional text to speech (TTS) systems. However, the transferred emotion in the synthetic speech is not accurate and expressive enough with emotion category confusions. Moreover, it is hard to select an appropriate reference to deliver desired emotion strength. To solve these problems, we propose a novel approach based on Tacotron. First, we plug two emotion classifiers - one after the reference encoder, one after the decoder output - to enhance the emotion-discriminative ability of the emotion embedding and the predicted mel-spectrum. Second, we adopt style loss to measure the difference between the generated and reference mel-spectrum. The emotion strength in the synthetic speech can be controlled by adjusting the value of the emotion embedding as the emotion embedding can be viewed as the feature map of the mel-spectrum. Experiments on emotion transfer and strength control have shown that the synthetic speech of the proposed method is more accurate and expressive with less emotion category confusions and the control of emotion strength is more salient to listeners.",

keywords = "emotion strength control, emotion transfer, speech synthesis, style loss",

author = "Tao Li and Shan Yang and Liumeng Xue and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2021 IEEE.; 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021 ; Conference date: 24-01-2021 Through 27-01-2021",

year = "2021",

month = jan,

day = "24",

doi = "10.1109/ISCSLP49672.2021.9362069",

language = "英语",

series = "2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021",

}

Li, T, Yang, S, Xue, L & Xie, L 2021, Controllable Emotion Transfer for End-to-End Speech Synthesis. in 2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021., 9362069, 2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021, Institute of Electrical and Electronics Engineers Inc., 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021, Hong Kong, Hong Kong, 24/01/21. https://doi.org/10.1109/ISCSLP49672.2021.9362069

Controllable Emotion Transfer for End-to-End Speech Synthesis. / Li, Tao; Yang, Shan; Xue, Liumeng et al.
2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021. Institute of Electrical and Electronics Engineers Inc., 2021. 9362069 (2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Controllable Emotion Transfer for End-to-End Speech Synthesis

AU - Li, Tao

AU - Yang, Shan

AU - Xue, Liumeng

AU - Xie, Lei

PY - 2021/1/24

Y1 - 2021/1/24

N2 - Emotion embedding space learned from references is a straight-forward approach for emotion transfer in encoder-decoder structured emotional text to speech (TTS) systems. However, the transferred emotion in the synthetic speech is not accurate and expressive enough with emotion category confusions. Moreover, it is hard to select an appropriate reference to deliver desired emotion strength. To solve these problems, we propose a novel approach based on Tacotron. First, we plug two emotion classifiers - one after the reference encoder, one after the decoder output - to enhance the emotion-discriminative ability of the emotion embedding and the predicted mel-spectrum. Second, we adopt style loss to measure the difference between the generated and reference mel-spectrum. The emotion strength in the synthetic speech can be controlled by adjusting the value of the emotion embedding as the emotion embedding can be viewed as the feature map of the mel-spectrum. Experiments on emotion transfer and strength control have shown that the synthetic speech of the proposed method is more accurate and expressive with less emotion category confusions and the control of emotion strength is more salient to listeners.

AB - Emotion embedding space learned from references is a straight-forward approach for emotion transfer in encoder-decoder structured emotional text to speech (TTS) systems. However, the transferred emotion in the synthetic speech is not accurate and expressive enough with emotion category confusions. Moreover, it is hard to select an appropriate reference to deliver desired emotion strength. To solve these problems, we propose a novel approach based on Tacotron. First, we plug two emotion classifiers - one after the reference encoder, one after the decoder output - to enhance the emotion-discriminative ability of the emotion embedding and the predicted mel-spectrum. Second, we adopt style loss to measure the difference between the generated and reference mel-spectrum. The emotion strength in the synthetic speech can be controlled by adjusting the value of the emotion embedding as the emotion embedding can be viewed as the feature map of the mel-spectrum. Experiments on emotion transfer and strength control have shown that the synthetic speech of the proposed method is more accurate and expressive with less emotion category confusions and the control of emotion strength is more salient to listeners.

KW - emotion strength control

KW - emotion transfer

KW - speech synthesis

KW - style loss

UR - http://www.scopus.com/inward/record.url?scp=85102600932&partnerID=8YFLogxK

U2 - 10.1109/ISCSLP49672.2021.9362069

DO - 10.1109/ISCSLP49672.2021.9362069

M3 - 会议稿件

AN - SCOPUS:85102600932

T3 - 2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021

BT - 2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021

Y2 - 24 January 2021 through 27 January 2021

ER -

Li T, Yang S, Xue L, Xie L. Controllable Emotion Transfer for End-to-End Speech Synthesis. In 2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021. Institute of Electrical and Electronics Engineers Inc. 2021. 9362069. (2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021). doi: 10.1109/ISCSLP49672.2021.9362069

Controllable Emotion Transfer for End-to-End Speech Synthesis

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this