Zero-Shot Emotion Transfer for Cross-Lingual Speech Synthesis

Yuke Li, Xinfa Zhu, Yi Lei, Hai Li, Junhui Liu, Danming Xie, Lei Xie

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Zero-shot emotion transfer in cross-lingual speech synthesis aims to transfer emotion from an arbitrary speech reference in the source language to the synthetic speech in the target language. Building such a system faces challenges of unnatural foreign accents and difficulty in modeling the shared emotional expressions of different languages. Building on the DelightfulTTS [1] neural architecture, this paper addresses these challenges by introducing specifically-designed modules to model the language-specific prosody features and language-shared emotional expressions separately. Specifically, the language-specific speech prosody is learned by a non-autoregressive predictive coding (NPC) module [2] to improve the naturalness of the synthetic cross-lingual speech. The shared emotional expression between different languages is extracted from a pre-trained self-supervised model Hu BERT with strong generalization capabilities. We further use hierarchical emotion modeling to capture more comprehensive emotions across different languages. Experimental results demonstrate the proposed framework's effectiveness in synthesizing bi-lingual emotional speech for the monolingual target speaker without emotional training data11Speech samples: https://ykli22.github.io/ZSET/

Original languageEnglish
Title of host publication2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350306897
DOIs
StatePublished - 2023
Event2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 - Taipei, Taiwan, Province of China
Duration: 16 Dec 202320 Dec 2023

Publication series

Name2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

Conference

Conference2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
Country/TerritoryTaiwan, Province of China
CityTaipei
Period16/12/2320/12/23

Keywords

  • Emotional speech synthesis
  • Multi-lingual speech synthesis
  • Text-to-speech
  • Zero-shot cross-lingual emotion transfer

Fingerprint

Dive into the research topics of 'Zero-Shot Emotion Transfer for Cross-Lingual Speech Synthesis'. Together they form a unique fingerprint.

Cite this