Skip to main navigation Skip to search Skip to main content

DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion

  • Ziqian Ning
  • , Shuai Wang
  • , Pengcheng Zhu
  • , Zhichao Wang
  • , Jixun Yao
  • , Lei Xie
  • , Mengxiao Bi
  • Northwestern Polytechnical University Xian
  • Netease Games Ai Lab
  • The Chinese University of Hong Kong, Shenzhen

Research output: Contribution to journalConference articlepeer-review

3 Scopus citations

Abstract

Streaming voice conversion has become increasingly popular for its potential in real-time applications. The recently proposed DualVC 2 has achieved robust and high-quality streaming voice conversion with a latency of about 180ms. Nonetheless, the recognition-synthesis framework hinders end-to-end optimization, and the instability of automatic speech recognition (ASR) model with short chunks makes it challenging to further reduce latency. To address these issues, we propose an end-to-end model, DualVC 3. With speaker-independent semantic tokens to guide the training of the content encoder, the dependency on ASR is removed and the model can operate under extremely small chunks, with cascading errors eliminated. A language model is trained on the content encoder output to produce pseudo context by iteratively predicting future frames, providing more contextual information for the decoder to improve conversion quality. Experimental results demonstrate that DualVC 3 achieves comparable performance to DualVC 2 in subjective and objective metrics, with a latency of only 50 ms. We have made our audio samples publicly available.

Original languageEnglish
Pages (from-to)197-201
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
StatePublished - 2024
Event25th Interspeech Conferece 2024 - Kos Island, Greece
Duration: 1 Sep 20245 Sep 2024

Keywords

  • end-to-end
  • knowledge distillation
  • language model
  • self-supervised learning
  • streaming voice conversion

Fingerprint

Dive into the research topics of 'DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion'. Together they form a unique fingerprint.

Cite this