Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling

Xinfa Zhu; Yi Lei; Kun Song; Yongmao Zhang; Tao Li; Lei Xie

doi:10.1109/ICASSP49357.2023.10095776

Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling

Xinfa Zhu, Yi Lei, Kun Song, Yongmao Zhang, Tao Li, Lei Xie

School of Computer Science

Northwestern Polytechnical University Xian

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

12 Scopus citations

Abstract

This paper aims to synthesize the target speaker's speech with desired speaking style and emotion by transferring the style and emotion from reference speech recorded by other speakers. We address this challenging problem with a two-stage framework composed of a text-to-style-and-emotion (Text2SE) module and a style-and- emotion-to-wave (SE2Wave) module, bridging by neural bottleneck (BN) features. To further solve the multi-factor (speaker timbre, speaking style and emotion) decoupling problem, we adopt the multi-label binary vector (MBV) and mutual information (MI) minimization to respectively discretize the extracted embeddings and disentangle these highly entangled factors in both Text2SE and SE2Wave modules. Moreover, we introduce a semi-supervised training strategy to leverage data from multiple speakers, including emotion-labeled data, style-labeled data, and unlabeled data. To better transfer the fine-grained expression from references to the target speaker in non-parallel transfer, we introduce a reference-candidate pool and propose an attention-based reference selection approach. Extensive experiments demonstrate the good design of our model.

Original language	English
Title of host publication	ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	9781728163277
DOIs	https://doi.org/10.1109/ICASSP49357.2023.10095776
State	Published - 2023
Event	48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 - Rhodes Island, Greece Duration: 4 Jun 2023 → 10 Jun 2023

Publication series

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume	2023-June
ISSN (Print)	1520-6149

Conference

Conference	48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023
Country/Territory	Greece
City	Rhodes Island
Period	4/06/23 → 10/06/23

Keywords

emotion transfer
Expressive speech synthesis
multiple factors decoupling
style transfer
two-stage

Access to Document

10.1109/ICASSP49357.2023.10095776

Cite this

Zhu, X., Lei, Y., Song, K., Zhang, Y., Li, T., & Xie, L. (2023). Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2023-June). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP49357.2023.10095776

Zhu, Xinfa ; Lei, Yi ; Song, Kun et al. / Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings. Institute of Electrical and Electronics Engineers Inc., 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{e0a6bfefbc5e49bfbd96fd032e59bdd0,

title = "Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling",

abstract = "This paper aims to synthesize the target speaker's speech with desired speaking style and emotion by transferring the style and emotion from reference speech recorded by other speakers. We address this challenging problem with a two-stage framework composed of a text-to-style-and-emotion (Text2SE) module and a style-and- emotion-to-wave (SE2Wave) module, bridging by neural bottleneck (BN) features. To further solve the multi-factor (speaker timbre, speaking style and emotion) decoupling problem, we adopt the multi-label binary vector (MBV) and mutual information (MI) minimization to respectively discretize the extracted embeddings and disentangle these highly entangled factors in both Text2SE and SE2Wave modules. Moreover, we introduce a semi-supervised training strategy to leverage data from multiple speakers, including emotion-labeled data, style-labeled data, and unlabeled data. To better transfer the fine-grained expression from references to the target speaker in non-parallel transfer, we introduce a reference-candidate pool and propose an attention-based reference selection approach. Extensive experiments demonstrate the good design of our model.",

keywords = "emotion transfer, Expressive speech synthesis, multiple factors decoupling, style transfer, two-stage",

author = "Xinfa Zhu and Yi Lei and Kun Song and Yongmao Zhang and Tao Li and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 ; Conference date: 04-06-2023 Through 10-06-2023",

year = "2023",

doi = "10.1109/ICASSP49357.2023.10095776",

language = "英语",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings",

}

Zhu, X, Lei, Y, Song, K, Zhang, Y, Li, T & Xie, L 2023, Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling. in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2023-June, Institute of Electrical and Electronics Engineers Inc., 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023, Rhodes Island, Greece, 4/06/23. https://doi.org/10.1109/ICASSP49357.2023.10095776

Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling. / Zhu, Xinfa; Lei, Yi; Song, Kun et al.
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings. Institute of Electrical and Electronics Engineers Inc., 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2023-June).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling

AU - Zhu, Xinfa

AU - Lei, Yi

AU - Song, Kun

AU - Zhang, Yongmao

AU - Li, Tao

AU - Xie, Lei

PY - 2023

Y1 - 2023

N2 - This paper aims to synthesize the target speaker's speech with desired speaking style and emotion by transferring the style and emotion from reference speech recorded by other speakers. We address this challenging problem with a two-stage framework composed of a text-to-style-and-emotion (Text2SE) module and a style-and- emotion-to-wave (SE2Wave) module, bridging by neural bottleneck (BN) features. To further solve the multi-factor (speaker timbre, speaking style and emotion) decoupling problem, we adopt the multi-label binary vector (MBV) and mutual information (MI) minimization to respectively discretize the extracted embeddings and disentangle these highly entangled factors in both Text2SE and SE2Wave modules. Moreover, we introduce a semi-supervised training strategy to leverage data from multiple speakers, including emotion-labeled data, style-labeled data, and unlabeled data. To better transfer the fine-grained expression from references to the target speaker in non-parallel transfer, we introduce a reference-candidate pool and propose an attention-based reference selection approach. Extensive experiments demonstrate the good design of our model.

AB - This paper aims to synthesize the target speaker's speech with desired speaking style and emotion by transferring the style and emotion from reference speech recorded by other speakers. We address this challenging problem with a two-stage framework composed of a text-to-style-and-emotion (Text2SE) module and a style-and- emotion-to-wave (SE2Wave) module, bridging by neural bottleneck (BN) features. To further solve the multi-factor (speaker timbre, speaking style and emotion) decoupling problem, we adopt the multi-label binary vector (MBV) and mutual information (MI) minimization to respectively discretize the extracted embeddings and disentangle these highly entangled factors in both Text2SE and SE2Wave modules. Moreover, we introduce a semi-supervised training strategy to leverage data from multiple speakers, including emotion-labeled data, style-labeled data, and unlabeled data. To better transfer the fine-grained expression from references to the target speaker in non-parallel transfer, we introduce a reference-candidate pool and propose an attention-based reference selection approach. Extensive experiments demonstrate the good design of our model.

KW - emotion transfer

KW - Expressive speech synthesis

KW - multiple factors decoupling

KW - style transfer

KW - two-stage

UR - http://www.scopus.com/inward/record.url?scp=85168207415&partnerID=8YFLogxK

U2 - 10.1109/ICASSP49357.2023.10095776

DO - 10.1109/ICASSP49357.2023.10095776

M3 - 会议稿件

AN - SCOPUS:85168207415

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

BT - ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

Y2 - 4 June 2023 through 10 June 2023

ER -

Zhu X, Lei Y, Song K, Zhang Y, Li T, Xie L. Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings. Institute of Electrical and Electronics Engineers Inc. 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP49357.2023.10095776

Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this