DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion

Ziqian Ning; Shuai Wang; Pengcheng Zhu; Zhichao Wang; Jixun Yao; Lei Xie; Mengxiao Bi

doi:10.21437/Interspeech.2024-1857

DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion

Ziqian Ning, Shuai Wang, Pengcheng Zhu, Zhichao Wang, Jixun Yao, Lei Xie, Mengxiao Bi

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

2 引用（Scopus）

摘要

Streaming voice conversion has become increasingly popular for its potential in real-time applications. The recently proposed DualVC 2 has achieved robust and high-quality streaming voice conversion with a latency of about 180ms. Nonetheless, the recognition-synthesis framework hinders end-to-end optimization, and the instability of automatic speech recognition (ASR) model with short chunks makes it challenging to further reduce latency. To address these issues, we propose an end-to-end model, DualVC 3. With speaker-independent semantic tokens to guide the training of the content encoder, the dependency on ASR is removed and the model can operate under extremely small chunks, with cascading errors eliminated. A language model is trained on the content encoder output to produce pseudo context by iteratively predicting future frames, providing more contextual information for the decoder to improve conversion quality. Experimental results demonstrate that DualVC 3 achieves comparable performance to DualVC 2 in subjective and objective metrics, with a latency of only 50 ms. We have made our audio samples publicly available.

源语言	英语
页（从-至）	197-201
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOI	https://doi.org/10.21437/Interspeech.2024-1857
出版状态	已出版 - 2024
活动	25th Interspeech Conferece 2024 - Kos Island, 希腊期限: 1 9月 2024 → 5 9月 2024

访问文件

10.21437/Interspeech.2024-1857

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{c366fc95eeb54d5d8c5bd8397f28a565,

title = "DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion",

abstract = "Streaming voice conversion has become increasingly popular for its potential in real-time applications. The recently proposed DualVC 2 has achieved robust and high-quality streaming voice conversion with a latency of about 180ms. Nonetheless, the recognition-synthesis framework hinders end-to-end optimization, and the instability of automatic speech recognition (ASR) model with short chunks makes it challenging to further reduce latency. To address these issues, we propose an end-to-end model, DualVC 3. With speaker-independent semantic tokens to guide the training of the content encoder, the dependency on ASR is removed and the model can operate under extremely small chunks, with cascading errors eliminated. A language model is trained on the content encoder output to produce pseudo context by iteratively predicting future frames, providing more contextual information for the decoder to improve conversion quality. Experimental results demonstrate that DualVC 3 achieves comparable performance to DualVC 2 in subjective and objective metrics, with a latency of only 50 ms. We have made our audio samples publicly available.",

keywords = "end-to-end, knowledge distillation, language model, self-supervised learning, streaming voice conversion",

author = "Ziqian Ning and Shuai Wang and Pengcheng Zhu and Zhichao Wang and Jixun Yao and Lei Xie and Mengxiao Bi",

note = "Publisher Copyright: {\textcopyright} 2024 International Speech Communication Association. All rights reserved.; 25th Interspeech Conferece 2024 ; Conference date: 01-09-2024 Through 05-09-2024",

year = "2024",

doi = "10.21437/Interspeech.2024-1857",

language = "英语",

pages = "197--201",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - DualVC 3

T2 - 25th Interspeech Conferece 2024

AU - Ning, Ziqian

AU - Wang, Shuai

AU - Zhu, Pengcheng

AU - Wang, Zhichao

AU - Yao, Jixun

AU - Xie, Lei

AU - Bi, Mengxiao

PY - 2024

Y1 - 2024

N2 - Streaming voice conversion has become increasingly popular for its potential in real-time applications. The recently proposed DualVC 2 has achieved robust and high-quality streaming voice conversion with a latency of about 180ms. Nonetheless, the recognition-synthesis framework hinders end-to-end optimization, and the instability of automatic speech recognition (ASR) model with short chunks makes it challenging to further reduce latency. To address these issues, we propose an end-to-end model, DualVC 3. With speaker-independent semantic tokens to guide the training of the content encoder, the dependency on ASR is removed and the model can operate under extremely small chunks, with cascading errors eliminated. A language model is trained on the content encoder output to produce pseudo context by iteratively predicting future frames, providing more contextual information for the decoder to improve conversion quality. Experimental results demonstrate that DualVC 3 achieves comparable performance to DualVC 2 in subjective and objective metrics, with a latency of only 50 ms. We have made our audio samples publicly available.

AB - Streaming voice conversion has become increasingly popular for its potential in real-time applications. The recently proposed DualVC 2 has achieved robust and high-quality streaming voice conversion with a latency of about 180ms. Nonetheless, the recognition-synthesis framework hinders end-to-end optimization, and the instability of automatic speech recognition (ASR) model with short chunks makes it challenging to further reduce latency. To address these issues, we propose an end-to-end model, DualVC 3. With speaker-independent semantic tokens to guide the training of the content encoder, the dependency on ASR is removed and the model can operate under extremely small chunks, with cascading errors eliminated. A language model is trained on the content encoder output to produce pseudo context by iteratively predicting future frames, providing more contextual information for the decoder to improve conversion quality. Experimental results demonstrate that DualVC 3 achieves comparable performance to DualVC 2 in subjective and objective metrics, with a latency of only 50 ms. We have made our audio samples publicly available.

KW - end-to-end

KW - knowledge distillation

KW - language model

KW - self-supervised learning

KW - streaming voice conversion

UR - http://www.scopus.com/inward/record.url?scp=85208637733&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2024-1857

DO - 10.21437/Interspeech.2024-1857

M3 - 会议文章

AN - SCOPUS:85208637733

SN - 2308-457X

SP - 197

EP - 201

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Y2 - 1 September 2024 through 5 September 2024

ER -

DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion

摘要

访问文件

其它文件与链接

指纹

引用此