Denoising recurrent neural network for deep bidirectional LSTM based voice conversion

Jie Wu; Dongyan Huang; Lei Xie; Haizhou Li

doi:10.21437/Interspeech.2017-694

Denoising recurrent neural network for deep bidirectional LSTM based voice conversion

Jie Wu, Dongyan Huang, Lei Xie, Haizhou Li

School of Computer Science

Research output: Contribution to journal › Conference article › peer-review

8 Scopus citations

Abstract

The paper studies the post processing in deep bidirectional Long Short-Term Memory (DBLSTM) based voice conversion, where the statistical parameters are optimized to generate speech that exhibits similar properties to target speech. However, there always exists residual error between converted speech and target one. We reformulate the residual error problem as speech restoration, which aims to recover the target speech samples from the converted ones. Specifically, we propose a denoising recurrent neural network (DeRNN) by introducing regularization during training to shape the distribution of the converted data in latent space. We compare the proposed approach with global variance (GV), modulation spectrum (MS) and recurrent neural network (RNN) based postfilters, which serve a similar purpose. The subjective test results show that the proposed approach significantly outperforms these conventional approaches in terms of quality and similarity.

Original language	English
Pages (from-to)	3379-3383
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2017-August
DOIs	https://doi.org/10.21437/Interspeech.2017-694
State	Published - 2017
Event	18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017 - Stockholm, Sweden Duration: 20 Aug 2017 → 24 Aug 2017

Keywords

Denoising
Gaussian noise
Recurrent neural network
Residual error
Voice conversion

Access to Document

10.21437/Interspeech.2017-694

Cite this

@article{9b3216652371473e9bc196b42a9a4563,

title = "Denoising recurrent neural network for deep bidirectional LSTM based voice conversion",

abstract = "The paper studies the post processing in deep bidirectional Long Short-Term Memory (DBLSTM) based voice conversion, where the statistical parameters are optimized to generate speech that exhibits similar properties to target speech. However, there always exists residual error between converted speech and target one. We reformulate the residual error problem as speech restoration, which aims to recover the target speech samples from the converted ones. Specifically, we propose a denoising recurrent neural network (DeRNN) by introducing regularization during training to shape the distribution of the converted data in latent space. We compare the proposed approach with global variance (GV), modulation spectrum (MS) and recurrent neural network (RNN) based postfilters, which serve a similar purpose. The subjective test results show that the proposed approach significantly outperforms these conventional approaches in terms of quality and similarity.",

keywords = "Denoising, Gaussian noise, Recurrent neural network, Residual error, Voice conversion",

author = "Jie Wu and Dongyan Huang and Lei Xie and Haizhou Li",

note = "Publisher Copyright: Copyright {\textcopyright} 2017 ISCA.; 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017 ; Conference date: 20-08-2017 Through 24-08-2017",

year = "2017",

doi = "10.21437/Interspeech.2017-694",

language = "英语",

volume = "2017-August",

pages = "3379--3383",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Denoising recurrent neural network for deep bidirectional LSTM based voice conversion

AU - Wu, Jie

AU - Huang, Dongyan

AU - Xie, Lei

AU - Li, Haizhou

PY - 2017

Y1 - 2017

N2 - The paper studies the post processing in deep bidirectional Long Short-Term Memory (DBLSTM) based voice conversion, where the statistical parameters are optimized to generate speech that exhibits similar properties to target speech. However, there always exists residual error between converted speech and target one. We reformulate the residual error problem as speech restoration, which aims to recover the target speech samples from the converted ones. Specifically, we propose a denoising recurrent neural network (DeRNN) by introducing regularization during training to shape the distribution of the converted data in latent space. We compare the proposed approach with global variance (GV), modulation spectrum (MS) and recurrent neural network (RNN) based postfilters, which serve a similar purpose. The subjective test results show that the proposed approach significantly outperforms these conventional approaches in terms of quality and similarity.

AB - The paper studies the post processing in deep bidirectional Long Short-Term Memory (DBLSTM) based voice conversion, where the statistical parameters are optimized to generate speech that exhibits similar properties to target speech. However, there always exists residual error between converted speech and target one. We reformulate the residual error problem as speech restoration, which aims to recover the target speech samples from the converted ones. Specifically, we propose a denoising recurrent neural network (DeRNN) by introducing regularization during training to shape the distribution of the converted data in latent space. We compare the proposed approach with global variance (GV), modulation spectrum (MS) and recurrent neural network (RNN) based postfilters, which serve a similar purpose. The subjective test results show that the proposed approach significantly outperforms these conventional approaches in terms of quality and similarity.

KW - Denoising

KW - Gaussian noise

KW - Recurrent neural network

KW - Residual error

KW - Voice conversion

UR - http://www.scopus.com/inward/record.url?scp=85039166058&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2017-694

DO - 10.21437/Interspeech.2017-694

M3 - 会议文章

AN - SCOPUS:85039166058

SN - 2308-457X

VL - 2017-August

SP - 3379

EP - 3383

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017

Y2 - 20 August 2017 through 24 August 2017

ER -

Denoising recurrent neural network for deep bidirectional LSTM based voice conversion

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this