TY - GEN
T1 - Boosting Multi-Speaker Expressive Speech Synthesis with Semi-Supervised Contrastive Learning
AU - Zhu, Xinfa
AU - Li, Yuke
AU - Lei, Yi
AU - Jiang, Ning
AU - Zhao, Guoqing
AU - Xie, Lei
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, emotion, and speaker representations from speech for style and emotion transfer. Furthermore, a semi-supervised training strategy is introduced to improve the data utilization efficiency by involving multi-domain data, including style-labeled data, emotion-labeled data, and abundant unlabeled data. To achieve expressive speech with diverse styles and emotions for a target speaker, the learned disentangled representations are integrated into an improved VITS model. Experiments on multi-domain data demonstrate the effectiveness of the proposed method.
AB - This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, emotion, and speaker representations from speech for style and emotion transfer. Furthermore, a semi-supervised training strategy is introduced to improve the data utilization efficiency by involving multi-domain data, including style-labeled data, emotion-labeled data, and abundant unlabeled data. To achieve expressive speech with diverse styles and emotions for a target speaker, the learned disentangled representations are integrated into an improved VITS model. Experiments on multi-domain data demonstrate the effectiveness of the proposed method.
KW - contrastive learning
KW - emotion transfer
KW - expressive speech synthesis
KW - semi-supervised
KW - style transfer
UR - http://www.scopus.com/inward/record.url?scp=85206564173&partnerID=8YFLogxK
U2 - 10.1109/ICME57554.2024.10688322
DO - 10.1109/ICME57554.2024.10688322
M3 - 会议稿件
AN - SCOPUS:85206564173
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - 2024 IEEE International Conference on Multimedia and Expo, ICME 2024
PB - IEEE Computer Society
T2 - 2024 IEEE International Conference on Multimedia and Expo, ICME 2024
Y2 - 15 July 2024 through 19 July 2024
ER -