Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion

Huaiping Ming; Dongyan Huang; Lei Xie; Jie Wu; Minghui Dong; Haizhou Li

doi:10.21437/Interspeech.2016-1053

Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion

Huaiping Ming, Dongyan Huang, Lei Xie, Jie Wu, Minghui Dong, Haizhou Li

School of Computer Science

Research output: Contribution to journal › Conference article › peer-review

75 Scopus citations

Abstract

Emotional voice conversion aims at converting speech from one emotion state to another. This paper proposes to model tim- bre and prosody features using a deep bidirectional long short- term memory (DBLSTM) for emotional voice conversion. A continuous wavelet transform (CWT) representation of funda- mental frequency (F0) and energy contour are used for prosody modeling. Specifically, we use CWT to decompose F0 into a five-scale representation, and decompose energy contour into a ten-scale representation, where each feature scale corresponds to a temporal scale. Both spectrum and prosody (F0 and energy contour) features are simultaneously converted by a sequence to sequence conversion method with DBLSTM model, which captures both frame-wise and long-range relationship between source and target voice. The converted speech signals are e- valuated both objectively and subjectively, which confirms the effectiveness of the proposed method.

Original language	English
Pages (from-to)	2453-2457
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	08-12-September-2016
DOIs	https://doi.org/10.21437/Interspeech.2016-1053
State	Published - 2016
Event	17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016 - San Francisco, United States Duration: 8 Sep 2016 → 16 Sep 2016

Keywords

Long short-term memory
Prosody
Recurrent neural networks
Voice conversion

Access to Document

10.21437/Interspeech.2016-1053

Cite this

@article{e598b65cb51a4f10870a060423d7fb1e,

title = "Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion",

abstract = "Emotional voice conversion aims at converting speech from one emotion state to another. This paper proposes to model tim- bre and prosody features using a deep bidirectional long short- term memory (DBLSTM) for emotional voice conversion. A continuous wavelet transform (CWT) representation of funda- mental frequency (F0) and energy contour are used for prosody modeling. Specifically, we use CWT to decompose F0 into a five-scale representation, and decompose energy contour into a ten-scale representation, where each feature scale corresponds to a temporal scale. Both spectrum and prosody (F0 and energy contour) features are simultaneously converted by a sequence to sequence conversion method with DBLSTM model, which captures both frame-wise and long-range relationship between source and target voice. The converted speech signals are e- valuated both objectively and subjectively, which confirms the effectiveness of the proposed method.",

keywords = "Long short-term memory, Prosody, Recurrent neural networks, Voice conversion",

author = "Huaiping Ming and Dongyan Huang and Lei Xie and Jie Wu and Minghui Dong and Haizhou Li",

note = "Publisher Copyright: Copyright {\textcopyright} 2016 ISCA.; 17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016 ; Conference date: 08-09-2016 Through 16-09-2016",

year = "2016",

doi = "10.21437/Interspeech.2016-1053",

language = "英语",

volume = "08-12-September-2016",

pages = "2453--2457",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion. / Ming, Huaiping; Huang, Dongyan; Xie, Lei et al.
In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 08-12-September-2016, 2016, p. 2453-2457.

Research output: Contribution to journal › Conference article › peer-review

TY - JOUR

T1 - Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion

AU - Ming, Huaiping

AU - Huang, Dongyan

AU - Xie, Lei

AU - Wu, Jie

AU - Dong, Minghui

AU - Li, Haizhou

PY - 2016

Y1 - 2016

N2 - Emotional voice conversion aims at converting speech from one emotion state to another. This paper proposes to model tim- bre and prosody features using a deep bidirectional long short- term memory (DBLSTM) for emotional voice conversion. A continuous wavelet transform (CWT) representation of funda- mental frequency (F0) and energy contour are used for prosody modeling. Specifically, we use CWT to decompose F0 into a five-scale representation, and decompose energy contour into a ten-scale representation, where each feature scale corresponds to a temporal scale. Both spectrum and prosody (F0 and energy contour) features are simultaneously converted by a sequence to sequence conversion method with DBLSTM model, which captures both frame-wise and long-range relationship between source and target voice. The converted speech signals are e- valuated both objectively and subjectively, which confirms the effectiveness of the proposed method.

AB - Emotional voice conversion aims at converting speech from one emotion state to another. This paper proposes to model tim- bre and prosody features using a deep bidirectional long short- term memory (DBLSTM) for emotional voice conversion. A continuous wavelet transform (CWT) representation of funda- mental frequency (F0) and energy contour are used for prosody modeling. Specifically, we use CWT to decompose F0 into a five-scale representation, and decompose energy contour into a ten-scale representation, where each feature scale corresponds to a temporal scale. Both spectrum and prosody (F0 and energy contour) features are simultaneously converted by a sequence to sequence conversion method with DBLSTM model, which captures both frame-wise and long-range relationship between source and target voice. The converted speech signals are e- valuated both objectively and subjectively, which confirms the effectiveness of the proposed method.

KW - Long short-term memory

KW - Prosody

KW - Recurrent neural networks

KW - Voice conversion

UR - http://www.scopus.com/inward/record.url?scp=84994251909&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2016-1053

DO - 10.21437/Interspeech.2016-1053

M3 - 会议文章

AN - SCOPUS:84994251909

SN - 2308-457X

VL - 08-12-September-2016

SP - 2453

EP - 2457

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016

Y2 - 8 September 2016 through 16 September 2016

ER -

Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this