Boosting Multi-Speaker Expressive Speech Synthesis with Semi-Supervised Contrastive Learning

Xinfa Zhu; Yuke Li; Yi Lei; Ning Jiang; Guoqing Zhao; Lei Xie

doi:10.1109/ICME57554.2024.10688322

Boosting Multi-Speaker Expressive Speech Synthesis with Semi-Supervised Contrastive Learning

Xinfa Zhu, Yuke Li, Yi Lei, Ning Jiang, Guoqing Zhao, Lei Xie

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, emotion, and speaker representations from speech for style and emotion transfer. Furthermore, a semi-supervised training strategy is introduced to improve the data utilization efficiency by involving multi-domain data, including style-labeled data, emotion-labeled data, and abundant unlabeled data. To achieve expressive speech with diverse styles and emotions for a target speaker, the learned disentangled representations are integrated into an improved VITS model. Experiments on multi-domain data demonstrate the effectiveness of the proposed method.

Original language	English
Title of host publication	2024 IEEE International Conference on Multimedia and Expo, ICME 2024
Publisher	IEEE Computer Society
ISBN (Electronic)	9798350390155
DOIs	https://doi.org/10.1109/ICME57554.2024.10688322
State	Published - 2024
Event	2024 IEEE International Conference on Multimedia and Expo, ICME 2024 - Niagra Falls, Canada Duration: 15 Jul 2024 → 19 Jul 2024

Publication series

Name	Proceedings - IEEE International Conference on Multimedia and Expo
ISSN (Print)	1945-7871
ISSN (Electronic)	1945-788X

Conference

Conference	2024 IEEE International Conference on Multimedia and Expo, ICME 2024
Country/Territory	Canada
City	Niagra Falls
Period	15/07/24 → 19/07/24

Keywords

contrastive learning
emotion transfer
expressive speech synthesis
semi-supervised
style transfer

Access to Document

10.1109/ICME57554.2024.10688322

Cite this

Zhu, X., Li, Y., Lei, Y., Jiang, N., Zhao, G., & Xie, L. (2024). Boosting Multi-Speaker Expressive Speech Synthesis with Semi-Supervised Contrastive Learning. In 2024 IEEE International Conference on Multimedia and Expo, ICME 2024 (Proceedings - IEEE International Conference on Multimedia and Expo). IEEE Computer Society. https://doi.org/10.1109/ICME57554.2024.10688322

@inproceedings{8e904310fef04634ab837d47beead228,

title = "Boosting Multi-Speaker Expressive Speech Synthesis with Semi-Supervised Contrastive Learning",

abstract = "This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, emotion, and speaker representations from speech for style and emotion transfer. Furthermore, a semi-supervised training strategy is introduced to improve the data utilization efficiency by involving multi-domain data, including style-labeled data, emotion-labeled data, and abundant unlabeled data. To achieve expressive speech with diverse styles and emotions for a target speaker, the learned disentangled representations are integrated into an improved VITS model. Experiments on multi-domain data demonstrate the effectiveness of the proposed method.",

keywords = "contrastive learning, emotion transfer, expressive speech synthesis, semi-supervised, style transfer",

author = "Xinfa Zhu and Yuke Li and Yi Lei and Ning Jiang and Guoqing Zhao and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 2024 IEEE International Conference on Multimedia and Expo, ICME 2024 ; Conference date: 15-07-2024 Through 19-07-2024",

year = "2024",

doi = "10.1109/ICME57554.2024.10688322",

language = "英语",

series = "Proceedings - IEEE International Conference on Multimedia and Expo",

publisher = "IEEE Computer Society",

booktitle = "2024 IEEE International Conference on Multimedia and Expo, ICME 2024",

}

Zhu, X, Li, Y, Lei, Y, Jiang, N, Zhao, G & Xie, L 2024, Boosting Multi-Speaker Expressive Speech Synthesis with Semi-Supervised Contrastive Learning. in 2024 IEEE International Conference on Multimedia and Expo, ICME 2024. Proceedings - IEEE International Conference on Multimedia and Expo, IEEE Computer Society, 2024 IEEE International Conference on Multimedia and Expo, ICME 2024, Niagra Falls, Canada, 15/07/24. https://doi.org/10.1109/ICME57554.2024.10688322

Boosting Multi-Speaker Expressive Speech Synthesis with Semi-Supervised Contrastive Learning. / Zhu, Xinfa; Li, Yuke; Lei, Yi et al.
2024 IEEE International Conference on Multimedia and Expo, ICME 2024. IEEE Computer Society, 2024. (Proceedings - IEEE International Conference on Multimedia and Expo).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Boosting Multi-Speaker Expressive Speech Synthesis with Semi-Supervised Contrastive Learning

AU - Zhu, Xinfa

AU - Li, Yuke

AU - Lei, Yi

AU - Jiang, Ning

AU - Zhao, Guoqing

AU - Xie, Lei

PY - 2024

Y1 - 2024

N2 - This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, emotion, and speaker representations from speech for style and emotion transfer. Furthermore, a semi-supervised training strategy is introduced to improve the data utilization efficiency by involving multi-domain data, including style-labeled data, emotion-labeled data, and abundant unlabeled data. To achieve expressive speech with diverse styles and emotions for a target speaker, the learned disentangled representations are integrated into an improved VITS model. Experiments on multi-domain data demonstrate the effectiveness of the proposed method.

AB - This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, emotion, and speaker representations from speech for style and emotion transfer. Furthermore, a semi-supervised training strategy is introduced to improve the data utilization efficiency by involving multi-domain data, including style-labeled data, emotion-labeled data, and abundant unlabeled data. To achieve expressive speech with diverse styles and emotions for a target speaker, the learned disentangled representations are integrated into an improved VITS model. Experiments on multi-domain data demonstrate the effectiveness of the proposed method.

KW - contrastive learning

KW - emotion transfer

KW - expressive speech synthesis

KW - semi-supervised

KW - style transfer

UR - http://www.scopus.com/inward/record.url?scp=85206564173&partnerID=8YFLogxK

U2 - 10.1109/ICME57554.2024.10688322

DO - 10.1109/ICME57554.2024.10688322

M3 - 会议稿件

AN - SCOPUS:85206564173

T3 - Proceedings - IEEE International Conference on Multimedia and Expo

BT - 2024 IEEE International Conference on Multimedia and Expo, ICME 2024

PB - IEEE Computer Society

T2 - 2024 IEEE International Conference on Multimedia and Expo, ICME 2024

Y2 - 15 July 2024 through 19 July 2024

ER -

Boosting Multi-Speaker Expressive Speech Synthesis with Semi-Supervised Contrastive Learning

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this