U-Style: Cascading U-Nets with Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning

Tao Li; Zhichao Wang; Xinfa Zhu; Jian Cong; Qiao Tian; Yuping Wang; Lei Xie

doi:10.1109/TASLP.2024.3453606

U-Style: Cascading U-Nets with Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning

Tao Li, Zhichao Wang, Xinfa Zhu, Jian Cong, Qiao Tian, Yuping Wang, Lei Xie

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

1 引用（Scopus）

摘要

Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of zero-shot speaker and style cloning is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose U-Style, which employs Grad-TTS as the backbone, particularly cascading a speaker-specific encoder and a style-specific encoder between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning.

源语言	英语
页（从-至）	4026-4035
页数	10
期刊	IEEE/ACM Transactions on Audio Speech and Language Processing
卷	32
DOI	https://doi.org/10.1109/TASLP.2024.3453606
出版状态	已出版 - 2024

访问文件

10.1109/TASLP.2024.3453606

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{5100869a398543f3a6c3a6ae6ab0ebc5,

title = "U-Style: Cascading U-Nets with Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning",

abstract = "Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of zero-shot speaker and style cloning is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose U-Style, which employs Grad-TTS as the backbone, particularly cascading a speaker-specific encoder and a style-specific encoder between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning.",

keywords = "disentangling, speaker cloning, Speech synthesis, style cloning, U-net, zero-shot",

author = "Tao Li and Zhichao Wang and Xinfa Zhu and Jian Cong and Qiao Tian and Yuping Wang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2014 IEEE.",

year = "2024",

doi = "10.1109/TASLP.2024.3453606",

language = "英语",

volume = "32",

pages = "4026--4035",

journal = "IEEE/ACM Transactions on Audio Speech and Language Processing",

issn = "2329-9290",

publisher = "IEEE Advancing Technology for Humanity",

}

TY - JOUR

T1 - U-Style

T2 - Cascading U-Nets with Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning

AU - Li, Tao

AU - Wang, Zhichao

AU - Zhu, Xinfa

AU - Cong, Jian

AU - Tian, Qiao

AU - Wang, Yuping

AU - Xie, Lei

PY - 2024

Y1 - 2024

N2 - Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of zero-shot speaker and style cloning is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose U-Style, which employs Grad-TTS as the backbone, particularly cascading a speaker-specific encoder and a style-specific encoder between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning.

AB - Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of zero-shot speaker and style cloning is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose U-Style, which employs Grad-TTS as the backbone, particularly cascading a speaker-specific encoder and a style-specific encoder between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning.

KW - disentangling

KW - speaker cloning

KW - Speech synthesis

KW - style cloning

KW - U-net

KW - zero-shot

UR - http://www.scopus.com/inward/record.url?scp=85203495878&partnerID=8YFLogxK

U2 - 10.1109/TASLP.2024.3453606

DO - 10.1109/TASLP.2024.3453606

M3 - 文章

AN - SCOPUS:85203495878

SN - 2329-9290

VL - 32

SP - 4026

EP - 4035

JO - IEEE/ACM Transactions on Audio Speech and Language Processing

JF - IEEE/ACM Transactions on Audio Speech and Language Processing

ER -

U-Style: Cascading U-Nets with Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning

摘要

访问文件

其它文件与链接

指纹

引用此