TY - GEN
T1 - XEmoRAG
T2 - 2025 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2025
AU - Zuo, Tianlun
AU - Hu, Jingbin
AU - Li, Yuke
AU - Zhu, Xinfa
AU - Li, Hai
AU - Yan, Ying
AU - Liu, Junhui
AU - Xie, Danming
AU - Xie, Lei
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Zero-shot emotion transfer in cross-lingual speech synthesis refers to generating speech in a target language, where the emotion is expressed based on reference speech from a different source language. However, this task remains challenging due to the scarcity of parallel multilingual emotional corpora, the presence of foreign accent artifacts, and the difficulty of separating emotion from language-specific prosodic features. In this paper, we propose XEmoRAG, a novel framework to enable zero-shot emotion transfer from Chinese to Thai using a large language model (LLM)-based model, without relying on parallel emotional data. XEmoRAG extracts language-agnostic emotional embeddings from Chinese speech and retrieves emotionally matched Thai utterances from a curated emotional database, enabling controllable emotion transfer without explicit emotion labels. Additionally, a flow-matching alignment module minimizes pitch and duration mismatches, ensuring natural prosody. It also blends Chinese timbre into the Thai synthesis, enhancing rhythmic accuracy and emotional expression, while preserving speaker characteristics and emotional consistency. Experimental results show that XEmoRAG synthesizes expressive and natural Thai speech using only Chinese reference audio, without requiring explicit emotion labels. These results highlight XEmoRAG's capability to achieve flexible and low-resource emotional transfer across languages. Our demo is available at https://tlzuo-lesley.github.io/Demo-page/.
AB - Zero-shot emotion transfer in cross-lingual speech synthesis refers to generating speech in a target language, where the emotion is expressed based on reference speech from a different source language. However, this task remains challenging due to the scarcity of parallel multilingual emotional corpora, the presence of foreign accent artifacts, and the difficulty of separating emotion from language-specific prosodic features. In this paper, we propose XEmoRAG, a novel framework to enable zero-shot emotion transfer from Chinese to Thai using a large language model (LLM)-based model, without relying on parallel emotional data. XEmoRAG extracts language-agnostic emotional embeddings from Chinese speech and retrieves emotionally matched Thai utterances from a curated emotional database, enabling controllable emotion transfer without explicit emotion labels. Additionally, a flow-matching alignment module minimizes pitch and duration mismatches, ensuring natural prosody. It also blends Chinese timbre into the Thai synthesis, enhancing rhythmic accuracy and emotional expression, while preserving speaker characteristics and emotional consistency. Experimental results show that XEmoRAG synthesizes expressive and natural Thai speech using only Chinese reference audio, without requiring explicit emotion labels. These results highlight XEmoRAG's capability to achieve flexible and low-resource emotional transfer across languages. Our demo is available at https://tlzuo-lesley.github.io/Demo-page/.
KW - cross-lingual speech synthesis
KW - emotion transfer from Chinese to Thai
KW - zero-shot emotion transfer
UR - https://www.scopus.com/pages/publications/105036586363
U2 - 10.1109/ASRU65441.2025.11434682
DO - 10.1109/ASRU65441.2025.11434682
M3 - 会议稿件
AN - SCOPUS:105036586363
T3 - ASRU 2025 - 2025 IEEE Automatic Speech Recognition and Understanding Workshop
BT - ASRU 2025 - 2025 IEEE Automatic Speech Recognition and Understanding Workshop
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 6 December 2025 through 10 December 2025
ER -