TY - GEN
T1 - Noise Robust Singing Voice Synthesis Using Gaussian Mixture Variational Autoencoder
AU - Xue, Heyang
AU - Zhang, Xiao
AU - Wu, Jie
AU - Luan, Jian
AU - Wang, Yujun
AU - Xie, Lei
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/10/18
Y1 - 2021/10/18
N2 - Generating high-quality singing voice usually depends on a sizable studio-level singing corpus which is difficult and expensive to collect. In contrast, there is plenty of singing voice data that can be found on the Internet. However, the found singing data may be mixed by accompaniments or contaminated by environmental noises due to recording conditions. In this paper, we propose a noise robust singing voice synthesizer which incorporates Gaussian Mixture Variational Autoencoder (GMVAE) as the noise encoder to handle different noise conditions, generating clean singing voice from lyrics for target speaker. Specifically, the proposed synthesizer learns a multi-modal latent noise representation of various noise conditions in a continuous space without the use of an auxiliary noise classifier for noise representation learning or clean reference audio during the inference stage. Experiments show that the proposed synthesizer can generate clean and high-quality singing voice for target speaker with MOS close to reconstructed singing voice from ground truth mel-spectrogram with Griffin-Lim vocoder. Experiments also show the robustness of our approach under complex noise conditions.
AB - Generating high-quality singing voice usually depends on a sizable studio-level singing corpus which is difficult and expensive to collect. In contrast, there is plenty of singing voice data that can be found on the Internet. However, the found singing data may be mixed by accompaniments or contaminated by environmental noises due to recording conditions. In this paper, we propose a noise robust singing voice synthesizer which incorporates Gaussian Mixture Variational Autoencoder (GMVAE) as the noise encoder to handle different noise conditions, generating clean singing voice from lyrics for target speaker. Specifically, the proposed synthesizer learns a multi-modal latent noise representation of various noise conditions in a continuous space without the use of an auxiliary noise classifier for noise representation learning or clean reference audio during the inference stage. Experiments show that the proposed synthesizer can generate clean and high-quality singing voice for target speaker with MOS close to reconstructed singing voice from ground truth mel-spectrogram with Griffin-Lim vocoder. Experiments also show the robustness of our approach under complex noise conditions.
KW - Found data
KW - Gaussian mixture variational autoencoder
KW - Singing voice synthesis
UR - http://www.scopus.com/inward/record.url?scp=85122252592&partnerID=8YFLogxK
U2 - 10.1145/3461615.3491115
DO - 10.1145/3461615.3491115
M3 - 会议稿件
AN - SCOPUS:85122252592
T3 - ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction
SP - 131
EP - 136
BT - ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction
PB - Association for Computing Machinery, Inc
T2 - 23rd ACM International Conference on Multimodal Interaction, ICMI 2021
Y2 - 18 October 2021 through 22 October 2021
ER -