Noise Robust Singing Voice Synthesis Using Gaussian Mixture Variational Autoencoder

Heyang Xue; Xiao Zhang; Jie Wu; Jian Luan; Yujun Wang; Lei Xie

doi:10.1145/3461615.3491115

Noise Robust Singing Voice Synthesis Using Gaussian Mixture Variational Autoencoder

Heyang Xue, Xiao Zhang, Jie Wu, Jian Luan, Yujun Wang, Lei Xie

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Generating high-quality singing voice usually depends on a sizable studio-level singing corpus which is difficult and expensive to collect. In contrast, there is plenty of singing voice data that can be found on the Internet. However, the found singing data may be mixed by accompaniments or contaminated by environmental noises due to recording conditions. In this paper, we propose a noise robust singing voice synthesizer which incorporates Gaussian Mixture Variational Autoencoder (GMVAE) as the noise encoder to handle different noise conditions, generating clean singing voice from lyrics for target speaker. Specifically, the proposed synthesizer learns a multi-modal latent noise representation of various noise conditions in a continuous space without the use of an auxiliary noise classifier for noise representation learning or clean reference audio during the inference stage. Experiments show that the proposed synthesizer can generate clean and high-quality singing voice for target speaker with MOS close to reconstructed singing voice from ground truth mel-spectrogram with Griffin-Lim vocoder. Experiments also show the robustness of our approach under complex noise conditions.

Original language	English
Title of host publication	ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction
Publisher	Association for Computing Machinery, Inc
Pages	131-136
Number of pages	6
ISBN (Electronic)	9781450384711
DOIs	https://doi.org/10.1145/3461615.3491115
State	Published - 18 Oct 2021
Event	23rd ACM International Conference on Multimodal Interaction, ICMI 2021 - Virtual, Online, Canada Duration: 18 Oct 2021 → 22 Oct 2021

Publication series

Name	ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction

Conference

Conference	23rd ACM International Conference on Multimodal Interaction, ICMI 2021
Country/Territory	Canada
City	Virtual, Online
Period	18/10/21 → 22/10/21

Keywords

Found data
Gaussian mixture variational autoencoder
Singing voice synthesis

Access to Document

10.1145/3461615.3491115

Cite this

Xue, H., Zhang, X., Wu, J., Luan, J., Wang, Y., & Xie, L. (2021). Noise Robust Singing Voice Synthesis Using Gaussian Mixture Variational Autoencoder. In ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction (pp. 131-136). (ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction). Association for Computing Machinery, Inc. https://doi.org/10.1145/3461615.3491115

Xue, Heyang ; Zhang, Xiao ; Wu, Jie et al. / Noise Robust Singing Voice Synthesis Using Gaussian Mixture Variational Autoencoder. ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction. Association for Computing Machinery, Inc, 2021. pp. 131-136 (ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction).

@inproceedings{97d53361e5634df28d80a4349ce66f87,

title = "Noise Robust Singing Voice Synthesis Using Gaussian Mixture Variational Autoencoder",

abstract = "Generating high-quality singing voice usually depends on a sizable studio-level singing corpus which is difficult and expensive to collect. In contrast, there is plenty of singing voice data that can be found on the Internet. However, the found singing data may be mixed by accompaniments or contaminated by environmental noises due to recording conditions. In this paper, we propose a noise robust singing voice synthesizer which incorporates Gaussian Mixture Variational Autoencoder (GMVAE) as the noise encoder to handle different noise conditions, generating clean singing voice from lyrics for target speaker. Specifically, the proposed synthesizer learns a multi-modal latent noise representation of various noise conditions in a continuous space without the use of an auxiliary noise classifier for noise representation learning or clean reference audio during the inference stage. Experiments show that the proposed synthesizer can generate clean and high-quality singing voice for target speaker with MOS close to reconstructed singing voice from ground truth mel-spectrogram with Griffin-Lim vocoder. Experiments also show the robustness of our approach under complex noise conditions.",

keywords = "Found data, Gaussian mixture variational autoencoder, Singing voice synthesis",

author = "Heyang Xue and Xiao Zhang and Jie Wu and Jian Luan and Yujun Wang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2021 ACM.; 23rd ACM International Conference on Multimodal Interaction, ICMI 2021 ; Conference date: 18-10-2021 Through 22-10-2021",

year = "2021",

month = oct,

day = "18",

doi = "10.1145/3461615.3491115",

language = "英语",

series = "ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction",

publisher = "Association for Computing Machinery, Inc",

pages = "131--136",

booktitle = "ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction",

}

Xue, H, Zhang, X, Wu, J, Luan, J, Wang, Y & Xie, L 2021, Noise Robust Singing Voice Synthesis Using Gaussian Mixture Variational Autoencoder. in ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction. ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction, Association for Computing Machinery, Inc, pp. 131-136, 23rd ACM International Conference on Multimodal Interaction, ICMI 2021, Virtual, Online, Canada, 18/10/21. https://doi.org/10.1145/3461615.3491115

Noise Robust Singing Voice Synthesis Using Gaussian Mixture Variational Autoencoder. / Xue, Heyang; Zhang, Xiao; Wu, Jie et al.
ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction. Association for Computing Machinery, Inc, 2021. p. 131-136 (ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Noise Robust Singing Voice Synthesis Using Gaussian Mixture Variational Autoencoder

AU - Xue, Heyang

AU - Zhang, Xiao

AU - Wu, Jie

AU - Luan, Jian

AU - Wang, Yujun

AU - Xie, Lei

PY - 2021/10/18

Y1 - 2021/10/18

N2 - Generating high-quality singing voice usually depends on a sizable studio-level singing corpus which is difficult and expensive to collect. In contrast, there is plenty of singing voice data that can be found on the Internet. However, the found singing data may be mixed by accompaniments or contaminated by environmental noises due to recording conditions. In this paper, we propose a noise robust singing voice synthesizer which incorporates Gaussian Mixture Variational Autoencoder (GMVAE) as the noise encoder to handle different noise conditions, generating clean singing voice from lyrics for target speaker. Specifically, the proposed synthesizer learns a multi-modal latent noise representation of various noise conditions in a continuous space without the use of an auxiliary noise classifier for noise representation learning or clean reference audio during the inference stage. Experiments show that the proposed synthesizer can generate clean and high-quality singing voice for target speaker with MOS close to reconstructed singing voice from ground truth mel-spectrogram with Griffin-Lim vocoder. Experiments also show the robustness of our approach under complex noise conditions.

AB - Generating high-quality singing voice usually depends on a sizable studio-level singing corpus which is difficult and expensive to collect. In contrast, there is plenty of singing voice data that can be found on the Internet. However, the found singing data may be mixed by accompaniments or contaminated by environmental noises due to recording conditions. In this paper, we propose a noise robust singing voice synthesizer which incorporates Gaussian Mixture Variational Autoencoder (GMVAE) as the noise encoder to handle different noise conditions, generating clean singing voice from lyrics for target speaker. Specifically, the proposed synthesizer learns a multi-modal latent noise representation of various noise conditions in a continuous space without the use of an auxiliary noise classifier for noise representation learning or clean reference audio during the inference stage. Experiments show that the proposed synthesizer can generate clean and high-quality singing voice for target speaker with MOS close to reconstructed singing voice from ground truth mel-spectrogram with Griffin-Lim vocoder. Experiments also show the robustness of our approach under complex noise conditions.

KW - Found data

KW - Gaussian mixture variational autoencoder

KW - Singing voice synthesis

UR - http://www.scopus.com/inward/record.url?scp=85122252592&partnerID=8YFLogxK

U2 - 10.1145/3461615.3491115

DO - 10.1145/3461615.3491115

M3 - 会议稿件

AN - SCOPUS:85122252592

T3 - ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction

SP - 131

EP - 136

BT - ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction

PB - Association for Computing Machinery, Inc

T2 - 23rd ACM International Conference on Multimodal Interaction, ICMI 2021

Y2 - 18 October 2021 through 22 October 2021

ER -

Xue H, Zhang X, Wu J, Luan J, Wang Y, Xie L. Noise Robust Singing Voice Synthesis Using Gaussian Mixture Variational Autoencoder. In ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction. Association for Computing Machinery, Inc. 2021. p. 131-136. (ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction). doi: 10.1145/3461615.3491115

Noise Robust Singing Voice Synthesis Using Gaussian Mixture Variational Autoencoder

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this