PROMPTVC: FLEXIBLE STYLISTIC VOICE CONVERSION IN LATENT SPACE DRIVEN BY NATURAL LANGUAGE PROMPTS

Jixun Yao, Yuguang Yang, Yi Lei, Ziqian Ning, Yanni Hu, Yu Pan, Jingjing Yin, Hongbin Zhou, Heng Lu, Lei Xie

Research output: Contribution to journalConference articlepeer-review

8 Scopus citations

Abstract

Stylistic voice conversion aims to transform the style of source speech to a desired style according to real-world application demands. However, the current style voice conversion approach relies on pre-defined labels or reference speech to control the conversion process, which leads to limitations in style diversity or falls short in terms of the intuitive and interpretability of style representation. In this study, we propose PromptVC, a novel style voice conversion approach that employs a latent diffusion model to generate a style vector driven by natural language prompts. Specifically, the style vector is extracted by a style encoder during training, and then the latent diffusion model is trained independently to sample the style vector from noise, with this process being conditioned on natural language prompts. To improve style expressiveness, we leverage HuBERT to extract discrete tokens and replace them with the K-Means center embedding to serve as the linguistic content, which minimizes residual style information. Additionally, we deduplicate the same discrete token and employ a differentiable duration predictor to re-predict the duration of each token, which can adapt the duration of the same linguistic content to different styles. The subjective and objective evaluation results demonstrate the effectiveness of our proposed system.

Original languageEnglish
Pages (from-to)10571-10575
Number of pages5
JournalICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
DOIs
StatePublished - 2024
Event2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of
Duration: 14 Apr 202419 Apr 2024

Keywords

  • Voice conversion
  • latent diffusion
  • natural language prompts

Fingerprint

Dive into the research topics of 'PROMPTVC: FLEXIBLE STYLISTIC VOICE CONVERSION IN LATENT SPACE DRIVEN BY NATURAL LANGUAGE PROMPTS'. Together they form a unique fingerprint.

Cite this