Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios

Qicong Xie; Tao Li; Xinsheng Wang; Zhichao Wang; Lei Xie; Guoqiao Yu; Guanglu Wan

doi:10.1109/ISCSLP57327.2022.10038056

Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios

Qicong Xie, Tao Li, Xinsheng Wang, Zhichao Wang, Lei Xie, Guoqiao Yu, Guanglu Wan

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

2 Scopus citations

Abstract

In the existing cross-speaker style transfer task, a source speaker with multi-style recordings is necessary to provide the style for a target speaker. However, it is hard for one speaker to express all expected styles. In this paper, a more general task, which is to produce expressive speech by combining any styles and timbres from a multi-speaker corpus in which each speaker has a unique style, is proposed. To realize this task, a novel method is proposed. This method is a Tacotron2-based framework but with a fine-grained text-based prosody predicting module and a speaker identity controller. Experiments demonstrate that the proposed method can successfully express a style of one speaker with the timber of another speaker bypassing the dependency on a single speaker's multi-style corpus. Moreover, the explicit prosody features used in the prosody predicting module can increase the diversity of synthetic speech by adjusting the value of prosody features.

Original language	English
Title of host publication	2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022
Editors	Kong Aik Lee, Hung-yi Lee, Yanfeng Lu, Minghui Dong
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	66-70
Number of pages	5
ISBN (Electronic)	9798350397963
DOIs	https://doi.org/10.1109/ISCSLP57327.2022.10038056
State	Published - 2022
Event	13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022 - Singapore, Singapore Duration: 11 Dec 2022 → 14 Dec 2022

Publication series

Name	2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022

Conference

Conference	13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022
Country/Territory	Singapore
City	Singapore
Period	11/12/22 → 14/12/22

Keywords

multi-speaker
multi-style
speech synthesis

Access to Document

10.1109/ISCSLP57327.2022.10038056

Cite this

Xie, Q., Li, T., Wang, X., Wang, Z., Xie, L., Yu, G., & Wan, G. (2022). Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios. In K. A. Lee, H. Lee, Y. Lu, & M. Dong (Eds.), 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022 (pp. 66-70). (2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ISCSLP57327.2022.10038056

Xie, Qicong ; Li, Tao ; Wang, Xinsheng et al. / Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios. 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022. editor / Kong Aik Lee ; Hung-yi Lee ; Yanfeng Lu ; Minghui Dong. Institute of Electrical and Electronics Engineers Inc., 2022. pp. 66-70 (2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022).

@inproceedings{a03bdafe8fcd43299ea028cdf8d90cdc,

title = "Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios",

abstract = "In the existing cross-speaker style transfer task, a source speaker with multi-style recordings is necessary to provide the style for a target speaker. However, it is hard for one speaker to express all expected styles. In this paper, a more general task, which is to produce expressive speech by combining any styles and timbres from a multi-speaker corpus in which each speaker has a unique style, is proposed. To realize this task, a novel method is proposed. This method is a Tacotron2-based framework but with a fine-grained text-based prosody predicting module and a speaker identity controller. Experiments demonstrate that the proposed method can successfully express a style of one speaker with the timber of another speaker bypassing the dependency on a single speaker's multi-style corpus. Moreover, the explicit prosody features used in the prosody predicting module can increase the diversity of synthetic speech by adjusting the value of prosody features.",

keywords = "multi-speaker, multi-style, speech synthesis",

author = "Qicong Xie and Tao Li and Xinsheng Wang and Zhichao Wang and Lei Xie and Guoqiao Yu and Guanglu Wan",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE.; 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022 ; Conference date: 11-12-2022 Through 14-12-2022",

year = "2022",

doi = "10.1109/ISCSLP57327.2022.10038056",

language = "英语",

series = "2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "66--70",

editor = "Lee, {Kong Aik} and Hung-yi Lee and Yanfeng Lu and Minghui Dong",

booktitle = "2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022",

}

Xie, Q, Li, T, Wang, X, Wang, Z, Xie, L, Yu, G & Wan, G 2022, Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios. in KA Lee, H Lee, Y Lu & M Dong (eds), 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022. 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022, Institute of Electrical and Electronics Engineers Inc., pp. 66-70, 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022, Singapore, Singapore, 11/12/22. https://doi.org/10.1109/ISCSLP57327.2022.10038056

Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios. / Xie, Qicong; Li, Tao; Wang, Xinsheng et al.
2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022. ed. / Kong Aik Lee; Hung-yi Lee; Yanfeng Lu; Minghui Dong. Institute of Electrical and Electronics Engineers Inc., 2022. p. 66-70 (2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios

AU - Xie, Qicong

AU - Li, Tao

AU - Wang, Xinsheng

AU - Wang, Zhichao

AU - Xie, Lei

AU - Yu, Guoqiao

AU - Wan, Guanglu

PY - 2022

Y1 - 2022

N2 - In the existing cross-speaker style transfer task, a source speaker with multi-style recordings is necessary to provide the style for a target speaker. However, it is hard for one speaker to express all expected styles. In this paper, a more general task, which is to produce expressive speech by combining any styles and timbres from a multi-speaker corpus in which each speaker has a unique style, is proposed. To realize this task, a novel method is proposed. This method is a Tacotron2-based framework but with a fine-grained text-based prosody predicting module and a speaker identity controller. Experiments demonstrate that the proposed method can successfully express a style of one speaker with the timber of another speaker bypassing the dependency on a single speaker's multi-style corpus. Moreover, the explicit prosody features used in the prosody predicting module can increase the diversity of synthetic speech by adjusting the value of prosody features.

AB - In the existing cross-speaker style transfer task, a source speaker with multi-style recordings is necessary to provide the style for a target speaker. However, it is hard for one speaker to express all expected styles. In this paper, a more general task, which is to produce expressive speech by combining any styles and timbres from a multi-speaker corpus in which each speaker has a unique style, is proposed. To realize this task, a novel method is proposed. This method is a Tacotron2-based framework but with a fine-grained text-based prosody predicting module and a speaker identity controller. Experiments demonstrate that the proposed method can successfully express a style of one speaker with the timber of another speaker bypassing the dependency on a single speaker's multi-style corpus. Moreover, the explicit prosody features used in the prosody predicting module can increase the diversity of synthetic speech by adjusting the value of prosody features.

KW - multi-speaker

KW - multi-style

KW - speech synthesis

UR - http://www.scopus.com/inward/record.url?scp=85148585466&partnerID=8YFLogxK

U2 - 10.1109/ISCSLP57327.2022.10038056

DO - 10.1109/ISCSLP57327.2022.10038056

M3 - 会议稿件

AN - SCOPUS:85148585466

T3 - 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022

SP - 66

EP - 70

BT - 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022

A2 - Lee, Kong Aik

A2 - Lee, Hung-yi

A2 - Lu, Yanfeng

A2 - Dong, Minghui

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022

Y2 - 11 December 2022 through 14 December 2022

ER -

Xie Q, Li T, Wang X, Wang Z, Xie L, Yu G et al. Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios. In Lee KA, Lee H, Lu Y, Dong M, editors, 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022. Institute of Electrical and Electronics Engineers Inc. 2022. p. 66-70. (2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022). doi: 10.1109/ISCSLP57327.2022.10038056

Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this