DiCLET-TTS: Diffusion Model Based Cross-Lingual Emotion Transfer for Text-to-Speech - A Study Between English and Mandarin

  • Tao Li
  • , Chenxu Hu
  • , Jian Cong
  • , Xinfa Zhu
  • , Jingbei Li
  • , Qiao Tian
  • , Yuping Wang
  • , Lei Xie

Research output: Contribution to journalArticlepeer-review

12 Scopus citations

Abstract

While the performance of cross-lingual TTS based on monolingual corpora has been significantly improved recently, generating cross-lingual speech still suffers from the foreign accent problem, leading to limited naturalness. Besides, current cross-lingual methods ignore modeling emotion, which is indispensable paralinguistic information in speech delivery. In this article, we propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer method that can transfer emotion from a source speaker to the intra- and cross-lingual target speakers. Specifically, to relieve the foreign accent problem while improving the emotion expressiveness, the terminal distribution of the forward diffusion process is parameterized into a speaker-irrelevant but emotion-related linguistic prior by a prior text encoder with the emotion embedding as a condition. To address the weaker emotional expressiveness problem caused by speaker disentanglement in emotion embedding, a novel orthogonal projection based emotion disentangling module (OP-EDM) is proposed to learn the speaker-irrelevant but emotion-discriminative embedding. Moreover, a condition-enhanced DPM decoder is introduced to strengthen the modeling ability of the speaker and the emotion in the reverse diffusion process to further improve emotion expressiveness in speech delivery. Cross-lingual emotion transfer experiments show the superiority of DiCLET-TTS over various competitive models and the good design of OP-EDM in learning speaker-irrelevant but emotion-discriminative embedding.

Original languageEnglish
Pages (from-to)3418-3430
Number of pages13
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume31
DOIs
StatePublished - 2023

Keywords

  • Cross-lingual
  • diffusion model
  • disentanglement
  • emotion transfer
  • speech synthesis

Fingerprint

Dive into the research topics of 'DiCLET-TTS: Diffusion Model Based Cross-Lingual Emotion Transfer for Text-to-Speech - A Study Between English and Mandarin'. Together they form a unique fingerprint.

Cite this