TY - JOUR
T1 - METTS
T2 - Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer
AU - Zhu, Xinfa
AU - Lei, Yi
AU - Li, Tao
AU - Zhang, Yongmao
AU - Zhou, Hongbin
AU - Lu, Heng
AU - Xie, Lei
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2024
Y1 - 2024
N2 - Previous multilingual text-to-speech (TTS) approaches have considered leveraging monolingual speaker data to enable cross-lingual speech synthesis. However, such data-efficient approaches have ignored synthesizing emotional aspects of speech due to the challenges of cross-speaker cross-lingual emotion transfer - the heavy entanglement of speaker timbre, emotion and language factors in the speech signal will make a system to produce cross-lingual synthetic speech with an undesired foreign accent and weak emotion expressiveness. This paper proposes a Multilingual Emotional TTS (METTS) model to mitigate these problems, realizing both cross-speaker and cross-lingual emotion transfer. Specifically, METTS takes DelightfulTTS as the backbone model and proposes the following designs. First, to alleviate the foreign accent problem, METTS introduces multi-scale emotion modeling to disentangle speech prosody into coarse-grained and fine-grained scales, producing language-agnostic and language-specific emotion representations, respectively. Second, as a pre-processing step, formant shift based information perturbation is applied to the reference signal for better disentanglement of speaker timbre in the speech. Third, a vector quantization based emotion matcher is designed for reference selection, leading to decent naturalness and emotion diversity in cross-lingual synthetic speech. Experiments demonstrate the good design of METTS.
AB - Previous multilingual text-to-speech (TTS) approaches have considered leveraging monolingual speaker data to enable cross-lingual speech synthesis. However, such data-efficient approaches have ignored synthesizing emotional aspects of speech due to the challenges of cross-speaker cross-lingual emotion transfer - the heavy entanglement of speaker timbre, emotion and language factors in the speech signal will make a system to produce cross-lingual synthetic speech with an undesired foreign accent and weak emotion expressiveness. This paper proposes a Multilingual Emotional TTS (METTS) model to mitigate these problems, realizing both cross-speaker and cross-lingual emotion transfer. Specifically, METTS takes DelightfulTTS as the backbone model and proposes the following designs. First, to alleviate the foreign accent problem, METTS introduces multi-scale emotion modeling to disentangle speech prosody into coarse-grained and fine-grained scales, producing language-agnostic and language-specific emotion representations, respectively. Second, as a pre-processing step, formant shift based information perturbation is applied to the reference signal for better disentanglement of speaker timbre in the speech. Third, a vector quantization based emotion matcher is designed for reference selection, leading to decent naturalness and emotion diversity in cross-lingual synthetic speech. Experiments demonstrate the good design of METTS.
KW - Cross-lingual
KW - disentanglement
KW - emotion transfer
KW - speech synthesis
UR - http://www.scopus.com/inward/record.url?scp=85184828021&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2024.3363444
DO - 10.1109/TASLP.2024.3363444
M3 - 文章
AN - SCOPUS:85184828021
SN - 2329-9290
VL - 32
SP - 1506
EP - 1518
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
ER -