Denoising recurrent neural network for deep bidirectional LSTM based voice conversion

Jie Wu; Dongyan Huang; Lei Xie; Haizhou Li

doi:10.21437/Interspeech.2017-694

Denoising recurrent neural network for deep bidirectional LSTM based voice conversion

Jie Wu, Dongyan Huang, Lei Xie, Haizhou Li

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

8 引用（Scopus）

摘要

The paper studies the post processing in deep bidirectional Long Short-Term Memory (DBLSTM) based voice conversion, where the statistical parameters are optimized to generate speech that exhibits similar properties to target speech. However, there always exists residual error between converted speech and target one. We reformulate the residual error problem as speech restoration, which aims to recover the target speech samples from the converted ones. Specifically, we propose a denoising recurrent neural network (DeRNN) by introducing regularization during training to shape the distribution of the converted data in latent space. We compare the proposed approach with global variance (GV), modulation spectrum (MS) and recurrent neural network (RNN) based postfilters, which serve a similar purpose. The subjective test results show that the proposed approach significantly outperforms these conventional approaches in terms of quality and similarity.

源语言	英语
页（从-至）	3379-3383
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷	2017-August
DOI	https://doi.org/10.21437/Interspeech.2017-694
出版状态	已出版 - 2017
活动	18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017 - Stockholm, 瑞典期限: 20 8月 2017 → 24 8月 2017

访问文件

10.21437/Interspeech.2017-694

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{9b3216652371473e9bc196b42a9a4563,

title = "Denoising recurrent neural network for deep bidirectional LSTM based voice conversion",

abstract = "The paper studies the post processing in deep bidirectional Long Short-Term Memory (DBLSTM) based voice conversion, where the statistical parameters are optimized to generate speech that exhibits similar properties to target speech. However, there always exists residual error between converted speech and target one. We reformulate the residual error problem as speech restoration, which aims to recover the target speech samples from the converted ones. Specifically, we propose a denoising recurrent neural network (DeRNN) by introducing regularization during training to shape the distribution of the converted data in latent space. We compare the proposed approach with global variance (GV), modulation spectrum (MS) and recurrent neural network (RNN) based postfilters, which serve a similar purpose. The subjective test results show that the proposed approach significantly outperforms these conventional approaches in terms of quality and similarity.",

keywords = "Denoising, Gaussian noise, Recurrent neural network, Residual error, Voice conversion",

author = "Jie Wu and Dongyan Huang and Lei Xie and Haizhou Li",

note = "Publisher Copyright: Copyright {\textcopyright} 2017 ISCA.; 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017 ; Conference date: 20-08-2017 Through 24-08-2017",

year = "2017",

doi = "10.21437/Interspeech.2017-694",

language = "英语",

volume = "2017-August",

pages = "3379--3383",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Denoising recurrent neural network for deep bidirectional LSTM based voice conversion

AU - Wu, Jie

AU - Huang, Dongyan

AU - Xie, Lei

AU - Li, Haizhou

PY - 2017

Y1 - 2017

N2 - The paper studies the post processing in deep bidirectional Long Short-Term Memory (DBLSTM) based voice conversion, where the statistical parameters are optimized to generate speech that exhibits similar properties to target speech. However, there always exists residual error between converted speech and target one. We reformulate the residual error problem as speech restoration, which aims to recover the target speech samples from the converted ones. Specifically, we propose a denoising recurrent neural network (DeRNN) by introducing regularization during training to shape the distribution of the converted data in latent space. We compare the proposed approach with global variance (GV), modulation spectrum (MS) and recurrent neural network (RNN) based postfilters, which serve a similar purpose. The subjective test results show that the proposed approach significantly outperforms these conventional approaches in terms of quality and similarity.

AB - The paper studies the post processing in deep bidirectional Long Short-Term Memory (DBLSTM) based voice conversion, where the statistical parameters are optimized to generate speech that exhibits similar properties to target speech. However, there always exists residual error between converted speech and target one. We reformulate the residual error problem as speech restoration, which aims to recover the target speech samples from the converted ones. Specifically, we propose a denoising recurrent neural network (DeRNN) by introducing regularization during training to shape the distribution of the converted data in latent space. We compare the proposed approach with global variance (GV), modulation spectrum (MS) and recurrent neural network (RNN) based postfilters, which serve a similar purpose. The subjective test results show that the proposed approach significantly outperforms these conventional approaches in terms of quality and similarity.

KW - Denoising

KW - Gaussian noise

KW - Recurrent neural network

KW - Residual error

KW - Voice conversion

UR - http://www.scopus.com/inward/record.url?scp=85039166058&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2017-694

DO - 10.21437/Interspeech.2017-694

M3 - 会议文章

AN - SCOPUS:85039166058

SN - 2308-457X

VL - 2017-August

SP - 3379

EP - 3383

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017

Y2 - 20 August 2017 through 24 August 2017

ER -

Denoising recurrent neural network for deep bidirectional LSTM based voice conversion

摘要

访问文件

其它文件与链接

指纹

引用此