Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis

Tao Li; Xinsheng Wang; Qicong Xie; Zhichao Wang; Mingqi Jiang; Lei Xie

doi:10.21437/Interspeech.2022-610

Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis

Tao Li, Xinsheng Wang, Qicong Xie, Zhichao Wang, Mingqi Jiang, Lei Xie

School of Computer Science

Research output: Contribution to journal › Conference article › peer-review

13 Scopus citations

Abstract

Cross-speaker emotion transfer speech synthesis aims to synthesize emotional speech for a target speaker by transferring the emotion from reference speech recorded by another (source) speaker. In this task, extracting speaker-independent emotion embedding from reference speech plays an important role. However, the emotional information conveyed by such emotion embedding tends to be weakened in the process to squeeze out the source speaker's timbre information. In response to this problem, a prosody compensation module (PCM) is proposed in this paper to compensate for the emotional information loss. Specifically, the PCM tries to obtain speaker-independent emotional information from the intermediate feature of a pre-trained ASR model. To this end, a prosody compensation encoder with global context (GC) blocks is introduced to obtain global emotional information from the ASR model's intermediate feature. Experiments demonstrate that the proposed PCM can effectively compensate the emotion embedding for the emotional information loss, and meanwhile maintain the timbre of the target speaker. Comparisons with state-of-the-art models show that our proposed method presents obvious superiority on the cross-speaker emotion transfer task.

Original language	English
Pages (from-to)	5498-5502
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2022-September
DOIs	https://doi.org/10.21437/Interspeech.2022-610
State	Published - 2022
Event	23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of Duration: 18 Sep 2022 → 22 Sep 2022

Keywords

computational paralinguistics
human-computer interaction
speech recognition

Access to Document

10.21437/Interspeech.2022-610

Cite this

@article{36641b2ffd694d259c8163f650941ea3,

title = "Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis",

abstract = "Cross-speaker emotion transfer speech synthesis aims to synthesize emotional speech for a target speaker by transferring the emotion from reference speech recorded by another (source) speaker. In this task, extracting speaker-independent emotion embedding from reference speech plays an important role. However, the emotional information conveyed by such emotion embedding tends to be weakened in the process to squeeze out the source speaker's timbre information. In response to this problem, a prosody compensation module (PCM) is proposed in this paper to compensate for the emotional information loss. Specifically, the PCM tries to obtain speaker-independent emotional information from the intermediate feature of a pre-trained ASR model. To this end, a prosody compensation encoder with global context (GC) blocks is introduced to obtain global emotional information from the ASR model's intermediate feature. Experiments demonstrate that the proposed PCM can effectively compensate the emotion embedding for the emotional information loss, and meanwhile maintain the timbre of the target speaker. Comparisons with state-of-the-art models show that our proposed method presents obvious superiority on the cross-speaker emotion transfer task.",

keywords = "computational paralinguistics, human-computer interaction, speech recognition",

author = "Tao Li and Xinsheng Wang and Qicong Xie and Zhichao Wang and Mingqi Jiang and Lei Xie",

note = "Publisher Copyright: Copyright {\textcopyright} 2022 ISCA.; 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 ; Conference date: 18-09-2022 Through 22-09-2022",

year = "2022",

doi = "10.21437/Interspeech.2022-610",

language = "英语",

volume = "2022-September",

pages = "5498--5502",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis

AU - Li, Tao

AU - Wang, Xinsheng

AU - Xie, Qicong

AU - Wang, Zhichao

AU - Jiang, Mingqi

AU - Xie, Lei

PY - 2022

Y1 - 2022

N2 - Cross-speaker emotion transfer speech synthesis aims to synthesize emotional speech for a target speaker by transferring the emotion from reference speech recorded by another (source) speaker. In this task, extracting speaker-independent emotion embedding from reference speech plays an important role. However, the emotional information conveyed by such emotion embedding tends to be weakened in the process to squeeze out the source speaker's timbre information. In response to this problem, a prosody compensation module (PCM) is proposed in this paper to compensate for the emotional information loss. Specifically, the PCM tries to obtain speaker-independent emotional information from the intermediate feature of a pre-trained ASR model. To this end, a prosody compensation encoder with global context (GC) blocks is introduced to obtain global emotional information from the ASR model's intermediate feature. Experiments demonstrate that the proposed PCM can effectively compensate the emotion embedding for the emotional information loss, and meanwhile maintain the timbre of the target speaker. Comparisons with state-of-the-art models show that our proposed method presents obvious superiority on the cross-speaker emotion transfer task.

AB - Cross-speaker emotion transfer speech synthesis aims to synthesize emotional speech for a target speaker by transferring the emotion from reference speech recorded by another (source) speaker. In this task, extracting speaker-independent emotion embedding from reference speech plays an important role. However, the emotional information conveyed by such emotion embedding tends to be weakened in the process to squeeze out the source speaker's timbre information. In response to this problem, a prosody compensation module (PCM) is proposed in this paper to compensate for the emotional information loss. Specifically, the PCM tries to obtain speaker-independent emotional information from the intermediate feature of a pre-trained ASR model. To this end, a prosody compensation encoder with global context (GC) blocks is introduced to obtain global emotional information from the ASR model's intermediate feature. Experiments demonstrate that the proposed PCM can effectively compensate the emotion embedding for the emotional information loss, and meanwhile maintain the timbre of the target speaker. Comparisons with state-of-the-art models show that our proposed method presents obvious superiority on the cross-speaker emotion transfer task.

KW - computational paralinguistics

KW - human-computer interaction

KW - speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85140092793&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2022-610

DO - 10.21437/Interspeech.2022-610

M3 - 会议文章

AN - SCOPUS:85140092793

SN - 2308-457X

VL - 2022-September

SP - 5498

EP - 5502

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022

Y2 - 18 September 2022 through 22 September 2022

ER -

Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this