ONE-SHOT VOICE CONVERSION FOR STYLE TRANSFER BASED ON SPEAKER ADAPTATION

Zhichao Wang; Qicong Xie; Tao Li; Hongqiang Du; Lei Xie; Pengcheng Zhu; Mengxiao Bi

doi:10.1109/ICASSP43922.2022.9746405

ONE-SHOT VOICE CONVERSION FOR STYLE TRANSFER BASED ON SPEAKER ADAPTATION

Zhichao Wang, Qicong Xie, Tao Li, Hongqiang Du, Lei Xie, Pengcheng Zhu, Mengxiao Bi

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

10 引用（Scopus）

摘要

One-shot style transfer is a challenging task, since training on one utterance makes model extremely easy to over-fit to training data and causes low speaker similarity and lack of expressiveness. In this paper, we build on the recognition-synthesis framework and propose a one-shot voice conversion approach for style transfer based on speaker adaptation. First, a speaker normalization module is adopted to remove speaker-related information in bottleneck features extracted by ASR. Second, we adopt weight regularization in the adaptation process to prevent over-fitting caused by using only one utterance from target speaker as training data. Finally, to comprehensively decouple the speech factors, i.e., content, speaker, style, and transfer source style to the target, a prosody module is used to extract prosody representation. Experiments show that our approach is superior to the state-of-the-art one-shot VC systems in terms of style and speaker similarity; additionally, our approach also maintains good speech quality.

源语言	英语
主期刊名	2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
出版商	Institute of Electrical and Electronics Engineers Inc.
页	6792-6796
页数	5
ISBN（电子版）	9781665405409
DOI	https://doi.org/10.1109/ICASSP43922.2022.9746405
出版状态	已出版 - 2022
活动	2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022 - Hybrid, 新加坡期限: 22 5月 2022 → 27 5月 2022

出版系列

姓名	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
卷	2022-May
ISSN（印刷版）	1520-6149

会议

会议	2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022
国家/地区	新加坡
市	Hybrid
时期	22/05/22 → 27/05/22

访问文件

10.1109/ICASSP43922.2022.9746405

其它文件与链接

链接到 Scopus 的出版物

引用此

Wang, Z., Xie, Q., Li, T., Du, H., Xie, L., Zhu, P., & Bi, M. (2022). ONE-SHOT VOICE CONVERSION FOR STYLE TRANSFER BASED ON SPEAKER ADAPTATION. 在 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings (页码 6792-6796). (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; 卷 2022-May). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP43922.2022.9746405

Wang, Zhichao ; Xie, Qicong ; Li, Tao 等. / ONE-SHOT VOICE CONVERSION FOR STYLE TRANSFER BASED ON SPEAKER ADAPTATION. 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2022. 页码 6792-6796 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{9331d61f7b204311bba3feb91713a11e,

title = "ONE-SHOT VOICE CONVERSION FOR STYLE TRANSFER BASED ON SPEAKER ADAPTATION",

abstract = "One-shot style transfer is a challenging task, since training on one utterance makes model extremely easy to over-fit to training data and causes low speaker similarity and lack of expressiveness. In this paper, we build on the recognition-synthesis framework and propose a one-shot voice conversion approach for style transfer based on speaker adaptation. First, a speaker normalization module is adopted to remove speaker-related information in bottleneck features extracted by ASR. Second, we adopt weight regularization in the adaptation process to prevent over-fitting caused by using only one utterance from target speaker as training data. Finally, to comprehensively decouple the speech factors, i.e., content, speaker, style, and transfer source style to the target, a prosody module is used to extract prosody representation. Experiments show that our approach is superior to the state-of-the-art one-shot VC systems in terms of style and speaker similarity; additionally, our approach also maintains good speech quality.",

keywords = "adaptation, one-shot, over-fit, style transfer, voice conversion",

author = "Zhichao Wang and Qicong Xie and Tao Li and Hongqiang Du and Lei Xie and Pengcheng Zhu and Mengxiao Bi",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE; 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022 ; Conference date: 22-05-2022 Through 27-05-2022",

year = "2022",

doi = "10.1109/ICASSP43922.2022.9746405",

language = "英语",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "6792--6796",

booktitle = "2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings",

}

Wang, Z, Xie, Q, Li, T, Du, H, Xie, L, Zhu, P & Bi, M 2022, ONE-SHOT VOICE CONVERSION FOR STYLE TRANSFER BASED ON SPEAKER ADAPTATION. 在 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 卷 2022-May, Institute of Electrical and Electronics Engineers Inc., 页码 6792-6796, 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Hybrid, 新加坡, 22/05/22. https://doi.org/10.1109/ICASSP43922.2022.9746405

ONE-SHOT VOICE CONVERSION FOR STYLE TRANSFER BASED ON SPEAKER ADAPTATION. / Wang, Zhichao; Xie, Qicong; Li, Tao 等.
2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2022. 页码 6792-6796 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; 卷 2022-May).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - ONE-SHOT VOICE CONVERSION FOR STYLE TRANSFER BASED ON SPEAKER ADAPTATION

AU - Wang, Zhichao

AU - Xie, Qicong

AU - Li, Tao

AU - Du, Hongqiang

AU - Xie, Lei

AU - Zhu, Pengcheng

AU - Bi, Mengxiao

PY - 2022

Y1 - 2022

N2 - One-shot style transfer is a challenging task, since training on one utterance makes model extremely easy to over-fit to training data and causes low speaker similarity and lack of expressiveness. In this paper, we build on the recognition-synthesis framework and propose a one-shot voice conversion approach for style transfer based on speaker adaptation. First, a speaker normalization module is adopted to remove speaker-related information in bottleneck features extracted by ASR. Second, we adopt weight regularization in the adaptation process to prevent over-fitting caused by using only one utterance from target speaker as training data. Finally, to comprehensively decouple the speech factors, i.e., content, speaker, style, and transfer source style to the target, a prosody module is used to extract prosody representation. Experiments show that our approach is superior to the state-of-the-art one-shot VC systems in terms of style and speaker similarity; additionally, our approach also maintains good speech quality.

AB - One-shot style transfer is a challenging task, since training on one utterance makes model extremely easy to over-fit to training data and causes low speaker similarity and lack of expressiveness. In this paper, we build on the recognition-synthesis framework and propose a one-shot voice conversion approach for style transfer based on speaker adaptation. First, a speaker normalization module is adopted to remove speaker-related information in bottleneck features extracted by ASR. Second, we adopt weight regularization in the adaptation process to prevent over-fitting caused by using only one utterance from target speaker as training data. Finally, to comprehensively decouple the speech factors, i.e., content, speaker, style, and transfer source style to the target, a prosody module is used to extract prosody representation. Experiments show that our approach is superior to the state-of-the-art one-shot VC systems in terms of style and speaker similarity; additionally, our approach also maintains good speech quality.

KW - adaptation

KW - one-shot

KW - over-fit

KW - style transfer

KW - voice conversion

UR - http://www.scopus.com/inward/record.url?scp=85131254208&partnerID=8YFLogxK

U2 - 10.1109/ICASSP43922.2022.9746405

DO - 10.1109/ICASSP43922.2022.9746405

M3 - 会议稿件

AN - SCOPUS:85131254208

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 6792

EP - 6796

BT - 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022

Y2 - 22 May 2022 through 27 May 2022

ER -

Wang Z, Xie Q, Li T, Du H, Xie L, Zhu P 等. ONE-SHOT VOICE CONVERSION FOR STYLE TRANSFER BASED ON SPEAKER ADAPTATION. 在 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2022. 页码 6792-6796. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP43922.2022.9746405

ONE-SHOT VOICE CONVERSION FOR STYLE TRANSFER BASED ON SPEAKER ADAPTATION

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此