VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer

Yongmao Zhang; Heyang Xue; Hanzhao Li; Lei Xie; Tingwei Guo; Ruixiong Zhang; Caixia Gong

doi:10.21437/Interspeech.2023-391

VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer

Yongmao Zhang, Heyang Xue, Hanzhao Li, Lei Xie, Tingwei Guo, Ruixiong Zhang, Caixia Gong

School of Computer Science

Research output: Contribution to journal › Conference article › peer-review

12 Scopus citations

Abstract

End-to-end singing voice synthesis (SVS) model VISinger [1] can achieve better performance than the typical two-stage model with fewer parameters. However, VISinger has several problems: text-to-phase problem, the end-to-end model learns the meaningless mapping of text-to-phase; glitches problem, the harmonic components corresponding to the periodic signal of the voiced segment occurs a sudden change with audible artefacts; low sampling rate, the sampling rate of 24KHz does not meet the application needs of high-fidelity generation with the full-band rate (44.1KHz or higher). In this paper, we propose VISinger 2 to address these issues by integrating the digital signal processing (DSP) methods with VISinger. Specifically, inspired by recent advances in differentiable digital signal processing (DDSP) [2], we incorporate a DSP synthesizer into the decoder to solve the above issues. The DSP synthesizer consists of a harmonic synthesizer and a noise synthesizer to generate periodic and aperiodic signals, respectively, from the latent representation z in VISinger. It supervises the posterior encoder to extract the latent representation without phase information and avoid the prior encoder modelling text-to-phase mapping. To avoid glitch artefacts, the HiFiGAN is modified to accept the waveforms generated by the DSP synthesizer as a condition to produce the singing voice. Moreover, with the improved waveform decoder, VISinger 2 manages to generate 44.1kHz singing audio with richer expression and better quality. Experiments on OpenCpop corpus [3] show that VISinger 2 outperforms VISinger, CpopSing and RefineSinger in both subjective and objective metrics. Our audio samples and source code are available.

Original language	English
Pages (from-to)	4444-4448
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2023-August
DOIs	https://doi.org/10.21437/Interspeech.2023-391
State	Published - 2023
Event	24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland Duration: 20 Aug 2023 → 24 Aug 2023

Keywords

adversarial learning
Singing voice synthesis
variational autoencoder

Access to Document

10.21437/Interspeech.2023-391

Cite this

@article{ca621735b2b04ddbb02f67af5a2c473c,

title = "VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer",

abstract = "End-to-end singing voice synthesis (SVS) model VISinger [1] can achieve better performance than the typical two-stage model with fewer parameters. However, VISinger has several problems: text-to-phase problem, the end-to-end model learns the meaningless mapping of text-to-phase; glitches problem, the harmonic components corresponding to the periodic signal of the voiced segment occurs a sudden change with audible artefacts; low sampling rate, the sampling rate of 24KHz does not meet the application needs of high-fidelity generation with the full-band rate (44.1KHz or higher). In this paper, we propose VISinger 2 to address these issues by integrating the digital signal processing (DSP) methods with VISinger. Specifically, inspired by recent advances in differentiable digital signal processing (DDSP) [2], we incorporate a DSP synthesizer into the decoder to solve the above issues. The DSP synthesizer consists of a harmonic synthesizer and a noise synthesizer to generate periodic and aperiodic signals, respectively, from the latent representation z in VISinger. It supervises the posterior encoder to extract the latent representation without phase information and avoid the prior encoder modelling text-to-phase mapping. To avoid glitch artefacts, the HiFiGAN is modified to accept the waveforms generated by the DSP synthesizer as a condition to produce the singing voice. Moreover, with the improved waveform decoder, VISinger 2 manages to generate 44.1kHz singing audio with richer expression and better quality. Experiments on OpenCpop corpus [3] show that VISinger 2 outperforms VISinger, CpopSing and RefineSinger in both subjective and objective metrics. Our audio samples and source code are available.",

keywords = "adversarial learning, Singing voice synthesis, variational autoencoder",

author = "Yongmao Zhang and Heyang Xue and Hanzhao Li and Lei Xie and Tingwei Guo and Ruixiong Zhang and Caixia Gong",

note = "Publisher Copyright: {\textcopyright} 2023 International Speech Communication Association. All rights reserved.; 24th International Speech Communication Association, Interspeech 2023 ; Conference date: 20-08-2023 Through 24-08-2023",

year = "2023",

doi = "10.21437/Interspeech.2023-391",

language = "英语",

volume = "2023-August",

pages = "4444--4448",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer. / Zhang, Yongmao; Xue, Heyang; Li, Hanzhao et al.
In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2023-August, 2023, p. 4444-4448.

Research output: Contribution to journal › Conference article › peer-review

TY - JOUR

T1 - VISinger 2

T2 - 24th International Speech Communication Association, Interspeech 2023

AU - Zhang, Yongmao

AU - Xue, Heyang

AU - Li, Hanzhao

AU - Xie, Lei

AU - Guo, Tingwei

AU - Zhang, Ruixiong

AU - Gong, Caixia

PY - 2023

Y1 - 2023

N2 - End-to-end singing voice synthesis (SVS) model VISinger [1] can achieve better performance than the typical two-stage model with fewer parameters. However, VISinger has several problems: text-to-phase problem, the end-to-end model learns the meaningless mapping of text-to-phase; glitches problem, the harmonic components corresponding to the periodic signal of the voiced segment occurs a sudden change with audible artefacts; low sampling rate, the sampling rate of 24KHz does not meet the application needs of high-fidelity generation with the full-band rate (44.1KHz or higher). In this paper, we propose VISinger 2 to address these issues by integrating the digital signal processing (DSP) methods with VISinger. Specifically, inspired by recent advances in differentiable digital signal processing (DDSP) [2], we incorporate a DSP synthesizer into the decoder to solve the above issues. The DSP synthesizer consists of a harmonic synthesizer and a noise synthesizer to generate periodic and aperiodic signals, respectively, from the latent representation z in VISinger. It supervises the posterior encoder to extract the latent representation without phase information and avoid the prior encoder modelling text-to-phase mapping. To avoid glitch artefacts, the HiFiGAN is modified to accept the waveforms generated by the DSP synthesizer as a condition to produce the singing voice. Moreover, with the improved waveform decoder, VISinger 2 manages to generate 44.1kHz singing audio with richer expression and better quality. Experiments on OpenCpop corpus [3] show that VISinger 2 outperforms VISinger, CpopSing and RefineSinger in both subjective and objective metrics. Our audio samples and source code are available.

AB - End-to-end singing voice synthesis (SVS) model VISinger [1] can achieve better performance than the typical two-stage model with fewer parameters. However, VISinger has several problems: text-to-phase problem, the end-to-end model learns the meaningless mapping of text-to-phase; glitches problem, the harmonic components corresponding to the periodic signal of the voiced segment occurs a sudden change with audible artefacts; low sampling rate, the sampling rate of 24KHz does not meet the application needs of high-fidelity generation with the full-band rate (44.1KHz or higher). In this paper, we propose VISinger 2 to address these issues by integrating the digital signal processing (DSP) methods with VISinger. Specifically, inspired by recent advances in differentiable digital signal processing (DDSP) [2], we incorporate a DSP synthesizer into the decoder to solve the above issues. The DSP synthesizer consists of a harmonic synthesizer and a noise synthesizer to generate periodic and aperiodic signals, respectively, from the latent representation z in VISinger. It supervises the posterior encoder to extract the latent representation without phase information and avoid the prior encoder modelling text-to-phase mapping. To avoid glitch artefacts, the HiFiGAN is modified to accept the waveforms generated by the DSP synthesizer as a condition to produce the singing voice. Moreover, with the improved waveform decoder, VISinger 2 manages to generate 44.1kHz singing audio with richer expression and better quality. Experiments on OpenCpop corpus [3] show that VISinger 2 outperforms VISinger, CpopSing and RefineSinger in both subjective and objective metrics. Our audio samples and source code are available.

KW - adversarial learning

KW - Singing voice synthesis

KW - variational autoencoder

UR - http://www.scopus.com/inward/record.url?scp=85171595594&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2023-391

DO - 10.21437/Interspeech.2023-391

M3 - 会议文章

AN - SCOPUS:85171595594

SN - 2308-457X

VL - 2023-August

SP - 4444

EP - 4448

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Y2 - 20 August 2023 through 24 August 2023

ER -

VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this