TY - GEN
T1 - Zero-Shot Emotion Transfer for Cross-Lingual Speech Synthesis
AU - Li, Yuke
AU - Zhu, Xinfa
AU - Lei, Yi
AU - Li, Hai
AU - Liu, Junhui
AU - Xie, Danming
AU - Xie, Lei
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Zero-shot emotion transfer in cross-lingual speech synthesis aims to transfer emotion from an arbitrary speech reference in the source language to the synthetic speech in the target language. Building such a system faces challenges of unnatural foreign accents and difficulty in modeling the shared emotional expressions of different languages. Building on the DelightfulTTS [1] neural architecture, this paper addresses these challenges by introducing specifically-designed modules to model the language-specific prosody features and language-shared emotional expressions separately. Specifically, the language-specific speech prosody is learned by a non-autoregressive predictive coding (NPC) module [2] to improve the naturalness of the synthetic cross-lingual speech. The shared emotional expression between different languages is extracted from a pre-trained self-supervised model Hu BERT with strong generalization capabilities. We further use hierarchical emotion modeling to capture more comprehensive emotions across different languages. Experimental results demonstrate the proposed framework's effectiveness in synthesizing bi-lingual emotional speech for the monolingual target speaker without emotional training data11Speech samples: https://ykli22.github.io/ZSET/
AB - Zero-shot emotion transfer in cross-lingual speech synthesis aims to transfer emotion from an arbitrary speech reference in the source language to the synthetic speech in the target language. Building such a system faces challenges of unnatural foreign accents and difficulty in modeling the shared emotional expressions of different languages. Building on the DelightfulTTS [1] neural architecture, this paper addresses these challenges by introducing specifically-designed modules to model the language-specific prosody features and language-shared emotional expressions separately. Specifically, the language-specific speech prosody is learned by a non-autoregressive predictive coding (NPC) module [2] to improve the naturalness of the synthetic cross-lingual speech. The shared emotional expression between different languages is extracted from a pre-trained self-supervised model Hu BERT with strong generalization capabilities. We further use hierarchical emotion modeling to capture more comprehensive emotions across different languages. Experimental results demonstrate the proposed framework's effectiveness in synthesizing bi-lingual emotional speech for the monolingual target speaker without emotional training data11Speech samples: https://ykli22.github.io/ZSET/
KW - Emotional speech synthesis
KW - Multi-lingual speech synthesis
KW - Text-to-speech
KW - Zero-shot cross-lingual emotion transfer
UR - http://www.scopus.com/inward/record.url?scp=85184659134&partnerID=8YFLogxK
U2 - 10.1109/ASRU57964.2023.10389638
DO - 10.1109/ASRU57964.2023.10389638
M3 - 会议稿件
AN - SCOPUS:85184659134
T3 - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
BT - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
Y2 - 16 December 2023 through 20 December 2023
ER -