Controlling Emotion Strength with Relative Attribute for End-To-End Speech Synthesis

Xiaolian Zhu; Shan Yang; Geng Yang; Lei Xie

doi:10.1109/ASRU46091.2019.9003829

Controlling Emotion Strength with Relative Attribute for End-To-End Speech Synthesis

Xiaolian Zhu, Shan Yang, Geng Yang, Lei Xie

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

46 Scopus citations

Abstract

Recently, attention-based end-To-end speech synthesis has achieved superior performance compared to traditional speech synthesis models, and several approaches like global style tokens are proposed to explore the style controllability of the end-To-end model. Although the existing methods show good performance in style disentanglement and transfer, it is still unable to control the explicit emotion of generated speech. In this paper, we mainly focus on the subtle control of expressive speech synthesis, where the emotion category and strength can be easily controlled with a discrete emotional vector and a continuous simple scalar, respectively. The continuous strength controller is learned by a ranking function according to the relative attribute measured on an emotion dataset. Our method automatically learns the relationship between low-level acoustic features and high-level subtle emotion strength. Experiments show that our method can effectively improve the controllability for an expressive end-To-end model.

Original language	English
Title of host publication	2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	192-199
Number of pages	8
ISBN (Electronic)	9781728103068
DOIs	https://doi.org/10.1109/ASRU46091.2019.9003829
State	Published - Dec 2019
Event	2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Singapore, Singapore Duration: 15 Dec 2019 → 18 Dec 2019

Publication series

Name	2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings

Conference

Conference	2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019
Country/Territory	Singapore
City	Singapore
Period	15/12/19 → 18/12/19

Keywords

Emotion strength
end-To-end
relative attributes
speech synthesis
text-To-speech

Access to Document

10.1109/ASRU46091.2019.9003829

Cite this

Zhu, X., Yang, S., Yang, G., & Xie, L. (2019). Controlling Emotion Strength with Relative Attribute for End-To-End Speech Synthesis. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings (pp. 192-199). Article 9003829 (2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASRU46091.2019.9003829

Zhu, Xiaolian ; Yang, Shan ; Yang, Geng et al. / Controlling Emotion Strength with Relative Attribute for End-To-End Speech Synthesis. 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 192-199 (2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings).

@inproceedings{11c5f5aea2bf46cc8404eb1c09fdd9b3,

title = "Controlling Emotion Strength with Relative Attribute for End-To-End Speech Synthesis",

abstract = "Recently, attention-based end-To-end speech synthesis has achieved superior performance compared to traditional speech synthesis models, and several approaches like global style tokens are proposed to explore the style controllability of the end-To-end model. Although the existing methods show good performance in style disentanglement and transfer, it is still unable to control the explicit emotion of generated speech. In this paper, we mainly focus on the subtle control of expressive speech synthesis, where the emotion category and strength can be easily controlled with a discrete emotional vector and a continuous simple scalar, respectively. The continuous strength controller is learned by a ranking function according to the relative attribute measured on an emotion dataset. Our method automatically learns the relationship between low-level acoustic features and high-level subtle emotion strength. Experiments show that our method can effectively improve the controllability for an expressive end-To-end model.",

keywords = "Emotion strength, end-To-end, relative attributes, speech synthesis, text-To-speech",

author = "Xiaolian Zhu and Shan Yang and Geng Yang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2019 IEEE.; 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 ; Conference date: 15-12-2019 Through 18-12-2019",

year = "2019",

month = dec,

doi = "10.1109/ASRU46091.2019.9003829",

language = "英语",

series = "2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "192--199",

booktitle = "2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings",

}

Zhu, X, Yang, S, Yang, G & Xie, L 2019, Controlling Emotion Strength with Relative Attribute for End-To-End Speech Synthesis. in 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings., 9003829, 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings, Institute of Electrical and Electronics Engineers Inc., pp. 192-199, 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019, Singapore, Singapore, 15/12/19. https://doi.org/10.1109/ASRU46091.2019.9003829

Controlling Emotion Strength with Relative Attribute for End-To-End Speech Synthesis. / Zhu, Xiaolian; Yang, Shan; Yang, Geng et al.
2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. p. 192-199 9003829 (2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Controlling Emotion Strength with Relative Attribute for End-To-End Speech Synthesis

AU - Zhu, Xiaolian

AU - Yang, Shan

AU - Yang, Geng

AU - Xie, Lei

PY - 2019/12

Y1 - 2019/12

N2 - Recently, attention-based end-To-end speech synthesis has achieved superior performance compared to traditional speech synthesis models, and several approaches like global style tokens are proposed to explore the style controllability of the end-To-end model. Although the existing methods show good performance in style disentanglement and transfer, it is still unable to control the explicit emotion of generated speech. In this paper, we mainly focus on the subtle control of expressive speech synthesis, where the emotion category and strength can be easily controlled with a discrete emotional vector and a continuous simple scalar, respectively. The continuous strength controller is learned by a ranking function according to the relative attribute measured on an emotion dataset. Our method automatically learns the relationship between low-level acoustic features and high-level subtle emotion strength. Experiments show that our method can effectively improve the controllability for an expressive end-To-end model.

AB - Recently, attention-based end-To-end speech synthesis has achieved superior performance compared to traditional speech synthesis models, and several approaches like global style tokens are proposed to explore the style controllability of the end-To-end model. Although the existing methods show good performance in style disentanglement and transfer, it is still unable to control the explicit emotion of generated speech. In this paper, we mainly focus on the subtle control of expressive speech synthesis, where the emotion category and strength can be easily controlled with a discrete emotional vector and a continuous simple scalar, respectively. The continuous strength controller is learned by a ranking function according to the relative attribute measured on an emotion dataset. Our method automatically learns the relationship between low-level acoustic features and high-level subtle emotion strength. Experiments show that our method can effectively improve the controllability for an expressive end-To-end model.

KW - Emotion strength

KW - end-To-end

KW - relative attributes

KW - speech synthesis

KW - text-To-speech

UR - http://www.scopus.com/inward/record.url?scp=85081596369&partnerID=8YFLogxK

U2 - 10.1109/ASRU46091.2019.9003829

DO - 10.1109/ASRU46091.2019.9003829

M3 - 会议稿件

AN - SCOPUS:85081596369

T3 - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings

SP - 192

EP - 199

BT - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019

Y2 - 15 December 2019 through 18 December 2019

ER -

Zhu X, Yang S, Yang G, Xie L. Controlling Emotion Strength with Relative Attribute for End-To-End Speech Synthesis. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2019. p. 192-199. 9003829. (2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings). doi: 10.1109/ASRU46091.2019.9003829

Controlling Emotion Strength with Relative Attribute for End-To-End Speech Synthesis

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this