DualVC: Dual-mode Voice Conversion using Intra-model Knowledge Distillation and Hybrid Predictive Coding

Ziqian Ning; Yuepeng Jiang; Pengcheng Zhu; Jixun Yao; Shuai Wang; Lei Xie; Mengxiao Bi

doi:10.21437/Interspeech.2023-1157

DualVC: Dual-mode Voice Conversion using Intra-model Knowledge Distillation and Hybrid Predictive Coding

Ziqian Ning, Yuepeng Jiang, Pengcheng Zhu, Jixun Yao, Shuai Wang, Lei Xie, Mengxiao Bi

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

9 引用（Scopus）

摘要

Voice conversion is an increasingly popular technology, and the growing number of real-time applications requires models with streaming conversion capabilities. Unlike typical (non-streaming) voice conversion, which can leverage the entire utterance as full context, streaming voice conversion faces significant challenges due to the missing future information, resulting in degraded intelligibility, speaker similarity, and sound quality. To address this challenge, we propose DualVC, a dual-mode neural voice conversion approach that supports both streaming and non-streaming modes using jointly trained separate network parameters. Furthermore, we propose intra-model knowledge distillation and hybrid predictive coding (HPC) to enhance the performance of streaming conversion. Additionally, we incorporate data augmentation to train a noise-robust autoregressive decoder, improving the model's performance on long-form speech conversion. Experimental results demonstrate that the proposed model outperforms the baseline models in the context of streaming voice conversion, while maintaining comparable performance to the non-streaming topline system that leverages the complete context, albeit with a latency of only 252.8 ms.

源语言	英语
页（从-至）	2063-2067
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷	2023-August
DOI	https://doi.org/10.21437/Interspeech.2023-1157
出版状态	已出版 - 2023
活动	24th International Speech Communication Association, Interspeech 2023 - Dublin, 爱尔兰期限: 20 8月 2023 → 24 8月 2023

访问文件

10.21437/Interspeech.2023-1157

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{98a3aed0adde49a0b15e0b73af60ee70,

title = "DualVC: Dual-mode Voice Conversion using Intra-model Knowledge Distillation and Hybrid Predictive Coding",

abstract = "Voice conversion is an increasingly popular technology, and the growing number of real-time applications requires models with streaming conversion capabilities. Unlike typical (non-streaming) voice conversion, which can leverage the entire utterance as full context, streaming voice conversion faces significant challenges due to the missing future information, resulting in degraded intelligibility, speaker similarity, and sound quality. To address this challenge, we propose DualVC, a dual-mode neural voice conversion approach that supports both streaming and non-streaming modes using jointly trained separate network parameters. Furthermore, we propose intra-model knowledge distillation and hybrid predictive coding (HPC) to enhance the performance of streaming conversion. Additionally, we incorporate data augmentation to train a noise-robust autoregressive decoder, improving the model's performance on long-form speech conversion. Experimental results demonstrate that the proposed model outperforms the baseline models in the context of streaming voice conversion, while maintaining comparable performance to the non-streaming topline system that leverages the complete context, albeit with a latency of only 252.8 ms.",

keywords = "dual-mode convolution, knowledge distillation, unsupervised representation learning, voice conversion",

author = "Ziqian Ning and Yuepeng Jiang and Pengcheng Zhu and Jixun Yao and Shuai Wang and Lei Xie and Mengxiao Bi",

note = "Publisher Copyright: {\textcopyright} 2023 International Speech Communication Association. All rights reserved.; 24th International Speech Communication Association, Interspeech 2023 ; Conference date: 20-08-2023 Through 24-08-2023",

year = "2023",

doi = "10.21437/Interspeech.2023-1157",

language = "英语",

volume = "2023-August",

pages = "2063--2067",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - DualVC

T2 - 24th International Speech Communication Association, Interspeech 2023

AU - Ning, Ziqian

AU - Jiang, Yuepeng

AU - Zhu, Pengcheng

AU - Yao, Jixun

AU - Wang, Shuai

AU - Xie, Lei

AU - Bi, Mengxiao

PY - 2023

Y1 - 2023

N2 - Voice conversion is an increasingly popular technology, and the growing number of real-time applications requires models with streaming conversion capabilities. Unlike typical (non-streaming) voice conversion, which can leverage the entire utterance as full context, streaming voice conversion faces significant challenges due to the missing future information, resulting in degraded intelligibility, speaker similarity, and sound quality. To address this challenge, we propose DualVC, a dual-mode neural voice conversion approach that supports both streaming and non-streaming modes using jointly trained separate network parameters. Furthermore, we propose intra-model knowledge distillation and hybrid predictive coding (HPC) to enhance the performance of streaming conversion. Additionally, we incorporate data augmentation to train a noise-robust autoregressive decoder, improving the model's performance on long-form speech conversion. Experimental results demonstrate that the proposed model outperforms the baseline models in the context of streaming voice conversion, while maintaining comparable performance to the non-streaming topline system that leverages the complete context, albeit with a latency of only 252.8 ms.

AB - Voice conversion is an increasingly popular technology, and the growing number of real-time applications requires models with streaming conversion capabilities. Unlike typical (non-streaming) voice conversion, which can leverage the entire utterance as full context, streaming voice conversion faces significant challenges due to the missing future information, resulting in degraded intelligibility, speaker similarity, and sound quality. To address this challenge, we propose DualVC, a dual-mode neural voice conversion approach that supports both streaming and non-streaming modes using jointly trained separate network parameters. Furthermore, we propose intra-model knowledge distillation and hybrid predictive coding (HPC) to enhance the performance of streaming conversion. Additionally, we incorporate data augmentation to train a noise-robust autoregressive decoder, improving the model's performance on long-form speech conversion. Experimental results demonstrate that the proposed model outperforms the baseline models in the context of streaming voice conversion, while maintaining comparable performance to the non-streaming topline system that leverages the complete context, albeit with a latency of only 252.8 ms.

KW - dual-mode convolution

KW - knowledge distillation

KW - unsupervised representation learning

KW - voice conversion

UR - http://www.scopus.com/inward/record.url?scp=85171591392&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2023-1157

DO - 10.21437/Interspeech.2023-1157

M3 - 会议文章

AN - SCOPUS:85171591392

SN - 2308-457X

VL - 2023-August

SP - 2063

EP - 2067

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Y2 - 20 August 2023 through 24 August 2023

ER -

DualVC: Dual-mode Voice Conversion using Intra-model Knowledge Distillation and Hybrid Predictive Coding

摘要

访问文件

其它文件与链接

指纹

引用此