Skip to main navigation Skip to search Skip to main content

XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation

  • Tianlun Zuo
  • , Jingbin Hu
  • , Yuke Li
  • , Xinfa Zhu
  • , Hai Li
  • , Ying Yan
  • , Junhui Liu
  • , Danming Xie
  • , Lei Xie
  • Northwestern Polytechnical University Xian
  • IQIYI Inc

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Zero-shot emotion transfer in cross-lingual speech synthesis refers to generating speech in a target language, where the emotion is expressed based on reference speech from a different source language. However, this task remains challenging due to the scarcity of parallel multilingual emotional corpora, the presence of foreign accent artifacts, and the difficulty of separating emotion from language-specific prosodic features. In this paper, we propose XEmoRAG, a novel framework to enable zero-shot emotion transfer from Chinese to Thai using a large language model (LLM)-based model, without relying on parallel emotional data. XEmoRAG extracts language-agnostic emotional embeddings from Chinese speech and retrieves emotionally matched Thai utterances from a curated emotional database, enabling controllable emotion transfer without explicit emotion labels. Additionally, a flow-matching alignment module minimizes pitch and duration mismatches, ensuring natural prosody. It also blends Chinese timbre into the Thai synthesis, enhancing rhythmic accuracy and emotional expression, while preserving speaker characteristics and emotional consistency. Experimental results show that XEmoRAG synthesizes expressive and natural Thai speech using only Chinese reference audio, without requiring explicit emotion labels. These results highlight XEmoRAG's capability to achieve flexible and low-resource emotional transfer across languages. Our demo is available at https://tlzuo-lesley.github.io/Demo-page/.

Original languageEnglish
Title of host publicationASRU 2025 - 2025 IEEE Automatic Speech Recognition and Understanding Workshop
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798331544263
DOIs
StatePublished - 2025
Event2025 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2025 - Honolulu, United States
Duration: 6 Dec 202510 Dec 2025

Publication series

NameASRU 2025 - 2025 IEEE Automatic Speech Recognition and Understanding Workshop

Conference

Conference2025 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2025
Country/TerritoryUnited States
CityHonolulu
Period6/12/2510/12/25

Keywords

  • cross-lingual speech synthesis
  • emotion transfer from Chinese to Thai
  • zero-shot emotion transfer

Fingerprint

Dive into the research topics of 'XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation'. Together they form a unique fingerprint.

Cite this