Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion

Huaiping Ming, Dongyan Huang, Lei Xie, Jie Wu, Minghui Dong, Haizhou Li

Research output: Contribution to journalConference articlepeer-review

75 Scopus citations

Abstract

Emotional voice conversion aims at converting speech from one emotion state to another. This paper proposes to model tim- bre and prosody features using a deep bidirectional long short- term memory (DBLSTM) for emotional voice conversion. A continuous wavelet transform (CWT) representation of funda- mental frequency (F0) and energy contour are used for prosody modeling. Specifically, we use CWT to decompose F0 into a five-scale representation, and decompose energy contour into a ten-scale representation, where each feature scale corresponds to a temporal scale. Both spectrum and prosody (F0 and energy contour) features are simultaneously converted by a sequence to sequence conversion method with DBLSTM model, which captures both frame-wise and long-range relationship between source and target voice. The converted speech signals are e- valuated both objectively and subjectively, which confirms the effectiveness of the proposed method.

Original languageEnglish
Pages (from-to)2453-2457
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume08-12-September-2016
DOIs
StatePublished - 2016
Event17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016 - San Francisco, United States
Duration: 8 Sep 201616 Sep 2016

Keywords

  • Long short-term memory
  • Prosody
  • Recurrent neural networks
  • Voice conversion

Fingerprint

Dive into the research topics of 'Deep bidirectional LSTM modeling of timbre and prosody for emotional voice conversion'. Together they form a unique fingerprint.

Cite this