TY - JOUR
T1 - Factorized WaveNet for voice conversion with limited data
AU - Du, Hongqiang
AU - Tian, Xiaohai
AU - Xie, Lei
AU - Li, Haizhou
N1 - Publisher Copyright:
© 2021
PY - 2021/6
Y1 - 2021/6
N2 - WaveNet is introduced for waveform generation. It produces high quality text-to-speech synthesis, music generation, and voice conversion. However, it generally requires a large amount of training data, that limits its scope of applications, e.g. in voice conversion. In this paper, we propose a factorized WaveNet for limited data tasks. Specifically, we apply singular value decomposition (SVD) on the dilated convolution layers of WaveNet to reduce the number of parameters. By doing so, we reduce the data requirement for WaveNet training, while maintaining similar network performance. We use voice conversion as a case study to validate the proposed idea. Two sets of experiments are conducted, where WaveNet is used as a vocoder and an integrated converter–vocoder respectively. Experiments on CMU-ARCTIC and CSTR-VCTK corpora show that factorized WaveNet consistently outperforms its original WaveNet counterpart when using the same amount of training data. We also apply SVD similarly to real-time neural vocoder Parallel WaveGAN for voice conversion, and observe similar improvement.
AB - WaveNet is introduced for waveform generation. It produces high quality text-to-speech synthesis, music generation, and voice conversion. However, it generally requires a large amount of training data, that limits its scope of applications, e.g. in voice conversion. In this paper, we propose a factorized WaveNet for limited data tasks. Specifically, we apply singular value decomposition (SVD) on the dilated convolution layers of WaveNet to reduce the number of parameters. By doing so, we reduce the data requirement for WaveNet training, while maintaining similar network performance. We use voice conversion as a case study to validate the proposed idea. Two sets of experiments are conducted, where WaveNet is used as a vocoder and an integrated converter–vocoder respectively. Experiments on CMU-ARCTIC and CSTR-VCTK corpora show that factorized WaveNet consistently outperforms its original WaveNet counterpart when using the same amount of training data. We also apply SVD similarly to real-time neural vocoder Parallel WaveGAN for voice conversion, and observe similar improvement.
KW - Parallel WaveGAN
KW - Singular value decomposition
KW - Voice conversion
KW - WaveNet
UR - https://www.scopus.com/pages/publications/85104428157
U2 - 10.1016/j.specom.2021.03.003
DO - 10.1016/j.specom.2021.03.003
M3 - 文章
AN - SCOPUS:85104428157
SN - 0167-6393
VL - 130
SP - 45
EP - 54
JO - Speech Communication
JF - Speech Communication
ER -