Boosting Multi-Speaker Expressive Speech Synthesis with Semi-Supervised Contrastive Learning

Xinfa Zhu, Yuke Li, Yi Lei, Ning Jiang, Guoqing Zhao, Lei Xie

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, emotion, and speaker representations from speech for style and emotion transfer. Furthermore, a semi-supervised training strategy is introduced to improve the data utilization efficiency by involving multi-domain data, including style-labeled data, emotion-labeled data, and abundant unlabeled data. To achieve expressive speech with diverse styles and emotions for a target speaker, the learned disentangled representations are integrated into an improved VITS model. Experiments on multi-domain data demonstrate the effectiveness of the proposed method.

源语言英语
主期刊名2024 IEEE International Conference on Multimedia and Expo, ICME 2024
出版商IEEE Computer Society
ISBN(电子版)9798350390155
DOI
出版状态已出版 - 2024
活动2024 IEEE International Conference on Multimedia and Expo, ICME 2024 - Niagra Falls, 加拿大
期限: 15 7月 202419 7月 2024

出版系列

姓名Proceedings - IEEE International Conference on Multimedia and Expo
ISSN(印刷版)1945-7871
ISSN(电子版)1945-788X

会议

会议2024 IEEE International Conference on Multimedia and Expo, ICME 2024
国家/地区加拿大
Niagra Falls
时期15/07/2419/07/24

指纹

探究 'Boosting Multi-Speaker Expressive Speech Synthesis with Semi-Supervised Contrastive Learning' 的科研主题。它们共同构成独一无二的指纹。

引用此