VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer

Yongmao Zhang, Heyang Xue, Hanzhao Li, Lei Xie, Tingwei Guo, Ruixiong Zhang, Caixia Gong

科研成果: 期刊稿件会议文章同行评审

12 引用 (Scopus)

摘要

End-to-end singing voice synthesis (SVS) model VISinger [1] can achieve better performance than the typical two-stage model with fewer parameters. However, VISinger has several problems: text-to-phase problem, the end-to-end model learns the meaningless mapping of text-to-phase; glitches problem, the harmonic components corresponding to the periodic signal of the voiced segment occurs a sudden change with audible artefacts; low sampling rate, the sampling rate of 24KHz does not meet the application needs of high-fidelity generation with the full-band rate (44.1KHz or higher). In this paper, we propose VISinger 2 to address these issues by integrating the digital signal processing (DSP) methods with VISinger. Specifically, inspired by recent advances in differentiable digital signal processing (DDSP) [2], we incorporate a DSP synthesizer into the decoder to solve the above issues. The DSP synthesizer consists of a harmonic synthesizer and a noise synthesizer to generate periodic and aperiodic signals, respectively, from the latent representation z in VISinger. It supervises the posterior encoder to extract the latent representation without phase information and avoid the prior encoder modelling text-to-phase mapping. To avoid glitch artefacts, the HiFiGAN is modified to accept the waveforms generated by the DSP synthesizer as a condition to produce the singing voice. Moreover, with the improved waveform decoder, VISinger 2 manages to generate 44.1kHz singing audio with richer expression and better quality. Experiments on OpenCpop corpus [3] show that VISinger 2 outperforms VISinger, CpopSing and RefineSinger in both subjective and objective metrics. Our audio samples and source code are available.

源语言英语
页(从-至)4444-4448
页数5
期刊Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
2023-August
DOI
出版状态已出版 - 2023
活动24th International Speech Communication Association, Interspeech 2023 - Dublin, 爱尔兰
期限: 20 8月 202324 8月 2023

指纹

探究 'VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer' 的科研主题。它们共同构成独一无二的指纹。

引用此