Delivering Speaking Style in Low-Resource Voice Conversion with Multi-Factor Constraints

Zhichao Wang; Xinsheng Wang; Lei Xie; Yuanzhe Chen; Qiao Tian; Yuping Wang

doi:10.1109/ICASSP49357.2023.10096471

Delivering Speaking Style in Low-Resource Voice Conversion with Multi-Factor Constraints

Zhichao Wang, Xinsheng Wang, Lei Xie, Yuanzhe Chen, Qiao Tian, Yuping Wang

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

1 引用（Scopus）

摘要

Conveying the linguistic content and maintaining the source speech's speaking style, such as intonation and emotion, is essential in voice conversion (VC). However, in a low-resource situation, where only limited utterances from the target speaker are accessible, existing VC methods are hard to meet this requirement and capture the target speaker's timber. In this work, a novel VC model, referred to as MFC-StyleVC, is proposed for the low-resource VC task. Specifically, speaker timbre constraint generated by clustering method is newly proposed to guide target speaker timbre learning in different stages. Meanwhile, to prevent over-fitting to the target speaker's limited data, perceptual regularization constraints explicitly maintain model performance on specific aspects, including speaking style, linguistic content, and speech quality. Besides, a simulation mode is introduced to simulate the inference process to alleviate the mis-match between training and inference. Extensive experiments performed on highly expressive speech demonstrate the superiority of the proposed method in low-resource VC.

源语言	英语
期刊	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
DOI	https://doi.org/10.1109/ICASSP49357.2023.10096471
出版状态	已出版 - 2023
活动	48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 - Rhodes Island, 希腊期限: 4 6月 2023 → 10 6月 2023

访问文件

10.1109/ICASSP49357.2023.10096471

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{9c5bf23fc3224368ac10137e60e77232,

title = "Delivering Speaking Style in Low-Resource Voice Conversion with Multi-Factor Constraints",

abstract = "Conveying the linguistic content and maintaining the source speech's speaking style, such as intonation and emotion, is essential in voice conversion (VC). However, in a low-resource situation, where only limited utterances from the target speaker are accessible, existing VC methods are hard to meet this requirement and capture the target speaker's timber. In this work, a novel VC model, referred to as MFC-StyleVC, is proposed for the low-resource VC task. Specifically, speaker timbre constraint generated by clustering method is newly proposed to guide target speaker timbre learning in different stages. Meanwhile, to prevent over-fitting to the target speaker's limited data, perceptual regularization constraints explicitly maintain model performance on specific aspects, including speaking style, linguistic content, and speech quality. Besides, a simulation mode is introduced to simulate the inference process to alleviate the mis-match between training and inference. Extensive experiments performed on highly expressive speech demonstrate the superiority of the proposed method in low-resource VC.",

keywords = "contrastive learning, low resource, speaker adaptation, speaking style, voice conversion",

author = "Zhichao Wang and Xinsheng Wang and Lei Xie and Yuanzhe Chen and Qiao Tian and Yuping Wang",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 ; Conference date: 04-06-2023 Through 10-06-2023",

year = "2023",

doi = "10.1109/ICASSP49357.2023.10096471",

language = "英语",

journal = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

issn = "1520-6149",

}

TY - JOUR

T1 - Delivering Speaking Style in Low-Resource Voice Conversion with Multi-Factor Constraints

AU - Wang, Zhichao

AU - Wang, Xinsheng

AU - Xie, Lei

AU - Chen, Yuanzhe

AU - Tian, Qiao

AU - Wang, Yuping

PY - 2023

Y1 - 2023

N2 - Conveying the linguistic content and maintaining the source speech's speaking style, such as intonation and emotion, is essential in voice conversion (VC). However, in a low-resource situation, where only limited utterances from the target speaker are accessible, existing VC methods are hard to meet this requirement and capture the target speaker's timber. In this work, a novel VC model, referred to as MFC-StyleVC, is proposed for the low-resource VC task. Specifically, speaker timbre constraint generated by clustering method is newly proposed to guide target speaker timbre learning in different stages. Meanwhile, to prevent over-fitting to the target speaker's limited data, perceptual regularization constraints explicitly maintain model performance on specific aspects, including speaking style, linguistic content, and speech quality. Besides, a simulation mode is introduced to simulate the inference process to alleviate the mis-match between training and inference. Extensive experiments performed on highly expressive speech demonstrate the superiority of the proposed method in low-resource VC.

AB - Conveying the linguistic content and maintaining the source speech's speaking style, such as intonation and emotion, is essential in voice conversion (VC). However, in a low-resource situation, where only limited utterances from the target speaker are accessible, existing VC methods are hard to meet this requirement and capture the target speaker's timber. In this work, a novel VC model, referred to as MFC-StyleVC, is proposed for the low-resource VC task. Specifically, speaker timbre constraint generated by clustering method is newly proposed to guide target speaker timbre learning in different stages. Meanwhile, to prevent over-fitting to the target speaker's limited data, perceptual regularization constraints explicitly maintain model performance on specific aspects, including speaking style, linguistic content, and speech quality. Besides, a simulation mode is introduced to simulate the inference process to alleviate the mis-match between training and inference. Extensive experiments performed on highly expressive speech demonstrate the superiority of the proposed method in low-resource VC.

KW - contrastive learning

KW - low resource

KW - speaker adaptation

KW - speaking style

KW - voice conversion

UR - http://www.scopus.com/inward/record.url?scp=85180609968&partnerID=8YFLogxK

U2 - 10.1109/ICASSP49357.2023.10096471

DO - 10.1109/ICASSP49357.2023.10096471

M3 - 会议文章

AN - SCOPUS:85180609968

SN - 1520-6149

JO - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

JF - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

T2 - 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

Y2 - 4 June 2023 through 10 June 2023

ER -

Delivering Speaking Style in Low-Resource Voice Conversion with Multi-Factor Constraints

摘要

访问文件

其它文件与链接

指纹

引用此