Abstract
Emotional voice conversion aims at converting speech from one emotion state to another. This paper proposes to model tim- bre and prosody features using a deep bidirectional long short- term memory (DBLSTM) for emotional voice conversion. A continuous wavelet transform (CWT) representation of funda- mental frequency (F0) and energy contour are used for prosody modeling. Specifically, we use CWT to decompose F0 into a five-scale representation, and decompose energy contour into a ten-scale representation, where each feature scale corresponds to a temporal scale. Both spectrum and prosody (F0 and energy contour) features are simultaneously converted by a sequence to sequence conversion method with DBLSTM model, which captures both frame-wise and long-range relationship between source and target voice. The converted speech signals are e- valuated both objectively and subjectively, which confirms the effectiveness of the proposed method.
Original language | English |
---|---|
Pages (from-to) | 2453-2457 |
Number of pages | 5 |
Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
Volume | 08-12-September-2016 |
DOIs | |
State | Published - 2016 |
Event | 17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016 - San Francisco, United States Duration: 8 Sep 2016 → 16 Sep 2016 |
Keywords
- Long short-term memory
- Prosody
- Recurrent neural networks
- Voice conversion