AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents

Yongmao Zhang; Zhichao Wang; Peiji Yang; Hongshen Sun; Zhisheng Wang; Lei Xie

doi:10.1109/ISCSLP57327.2022.10037914

AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents

Yongmao Zhang, Zhichao Wang, Peiji Yang, Hongshen Sun, Zhisheng Wang, Lei Xie

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

4 Scopus citations

Abstract

Learning accent from crowd-sourced data is a feasible way to achieve a target speaker TTS system that can synthesize accent speech. To this end, there are two challenging problems to be solved. First, direct use of the poor acoustic quality crowdsourced data and the target speaker data in accent transfer will apparently lead to synthetic speech with degraded quality. To mitigate this problem, we take a bottleneck feature (BN) based TTS approach, in which TTS is decomposed into a Text-to-BN (T2BN) module to learn accent and a BN-to-Mel (BN2Me1) module to learn speaker timbre, where neural network based BN feature serves as the intermediate representation that are robust to noise interference. Second, direct training T2BN using the crowd-sourced data in the two-stage system will produce accent speech of target speaker with poor prosody. This is because the the crowd-sourced recordings are contributed from the ordinary unprofessional speakers. To tackle this problem, we update the two-stage approach to a novel three-stage approach, where T2BN and BN2Me1 are trained using the high-quality target speaker data and a new BN-to-BN module is plugged in between the two modules to perform accent transfer. To train the BN2BN module, the parallel unaccented and accented BN features are obtained by a proposed data augmentation procedure. Finally the proposed three-stage approach manages to produce accent speech for the target speaker with good prosody, as the prosody pattern is inherited from the professional target speaker and accent transfer is achieved by the BN2BN module at the same time. The proposed approach, named as AccentSpeech, is validated in a Mandarin TTS accent transfer task.

Original language	English
Title of host publication	2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022
Editors	Kong Aik Lee, Hung-yi Lee, Yanfeng Lu, Minghui Dong
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	76-80
Number of pages	5
ISBN (Electronic)	9798350397963
DOIs	https://doi.org/10.1109/ISCSLP57327.2022.10037914
State	Published - 2022
Event	13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022 - Singapore, Singapore Duration: 11 Dec 2022 → 14 Dec 2022

Publication series

Name	2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022

Conference

Conference	13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022
Country/Territory	Singapore
City	Singapore
Period	11/12/22 → 14/12/22

Keywords

accent transfer
text to speech

Access to Document

10.1109/ISCSLP57327.2022.10037914

Cite this

Zhang, Y., Wang, Z., Yang, P., Sun, H., Wang, Z., & Xie, L. (2022). AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents. In K. A. Lee, H. Lee, Y. Lu, & M. Dong (Eds.), 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022 (pp. 76-80). (2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ISCSLP57327.2022.10037914

Zhang, Yongmao ; Wang, Zhichao ; Yang, Peiji et al. / AccentSpeech : Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents. 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022. editor / Kong Aik Lee ; Hung-yi Lee ; Yanfeng Lu ; Minghui Dong. Institute of Electrical and Electronics Engineers Inc., 2022. pp. 76-80 (2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022).

@inproceedings{c8e0653982a24f07a4c612e319ac0325,

title = "AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents",

abstract = "Learning accent from crowd-sourced data is a feasible way to achieve a target speaker TTS system that can synthesize accent speech. To this end, there are two challenging problems to be solved. First, direct use of the poor acoustic quality crowdsourced data and the target speaker data in accent transfer will apparently lead to synthetic speech with degraded quality. To mitigate this problem, we take a bottleneck feature (BN) based TTS approach, in which TTS is decomposed into a Text-to-BN (T2BN) module to learn accent and a BN-to-Mel (BN2Me1) module to learn speaker timbre, where neural network based BN feature serves as the intermediate representation that are robust to noise interference. Second, direct training T2BN using the crowd-sourced data in the two-stage system will produce accent speech of target speaker with poor prosody. This is because the the crowd-sourced recordings are contributed from the ordinary unprofessional speakers. To tackle this problem, we update the two-stage approach to a novel three-stage approach, where T2BN and BN2Me1 are trained using the high-quality target speaker data and a new BN-to-BN module is plugged in between the two modules to perform accent transfer. To train the BN2BN module, the parallel unaccented and accented BN features are obtained by a proposed data augmentation procedure. Finally the proposed three-stage approach manages to produce accent speech for the target speaker with good prosody, as the prosody pattern is inherited from the professional target speaker and accent transfer is achieved by the BN2BN module at the same time. The proposed approach, named as AccentSpeech, is validated in a Mandarin TTS accent transfer task.",

keywords = "accent transfer, text to speech",

author = "Yongmao Zhang and Zhichao Wang and Peiji Yang and Hongshen Sun and Zhisheng Wang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE.; 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022 ; Conference date: 11-12-2022 Through 14-12-2022",

year = "2022",

doi = "10.1109/ISCSLP57327.2022.10037914",

language = "英语",

series = "2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "76--80",

editor = "Lee, {Kong Aik} and Hung-yi Lee and Yanfeng Lu and Minghui Dong",

booktitle = "2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022",

}

Zhang, Y, Wang, Z, Yang, P, Sun, H, Wang, Z & Xie, L 2022, AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents. in KA Lee, H Lee, Y Lu & M Dong (eds), 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022. 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022, Institute of Electrical and Electronics Engineers Inc., pp. 76-80, 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022, Singapore, Singapore, 11/12/22. https://doi.org/10.1109/ISCSLP57327.2022.10037914

AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents. / Zhang, Yongmao; Wang, Zhichao; Yang, Peiji et al.
2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022. ed. / Kong Aik Lee; Hung-yi Lee; Yanfeng Lu; Minghui Dong. Institute of Electrical and Electronics Engineers Inc., 2022. p. 76-80 (2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - AccentSpeech

T2 - 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022

AU - Zhang, Yongmao

AU - Wang, Zhichao

AU - Yang, Peiji

AU - Sun, Hongshen

AU - Wang, Zhisheng

AU - Xie, Lei

PY - 2022

Y1 - 2022

N2 - Learning accent from crowd-sourced data is a feasible way to achieve a target speaker TTS system that can synthesize accent speech. To this end, there are two challenging problems to be solved. First, direct use of the poor acoustic quality crowdsourced data and the target speaker data in accent transfer will apparently lead to synthetic speech with degraded quality. To mitigate this problem, we take a bottleneck feature (BN) based TTS approach, in which TTS is decomposed into a Text-to-BN (T2BN) module to learn accent and a BN-to-Mel (BN2Me1) module to learn speaker timbre, where neural network based BN feature serves as the intermediate representation that are robust to noise interference. Second, direct training T2BN using the crowd-sourced data in the two-stage system will produce accent speech of target speaker with poor prosody. This is because the the crowd-sourced recordings are contributed from the ordinary unprofessional speakers. To tackle this problem, we update the two-stage approach to a novel three-stage approach, where T2BN and BN2Me1 are trained using the high-quality target speaker data and a new BN-to-BN module is plugged in between the two modules to perform accent transfer. To train the BN2BN module, the parallel unaccented and accented BN features are obtained by a proposed data augmentation procedure. Finally the proposed three-stage approach manages to produce accent speech for the target speaker with good prosody, as the prosody pattern is inherited from the professional target speaker and accent transfer is achieved by the BN2BN module at the same time. The proposed approach, named as AccentSpeech, is validated in a Mandarin TTS accent transfer task.

AB - Learning accent from crowd-sourced data is a feasible way to achieve a target speaker TTS system that can synthesize accent speech. To this end, there are two challenging problems to be solved. First, direct use of the poor acoustic quality crowdsourced data and the target speaker data in accent transfer will apparently lead to synthetic speech with degraded quality. To mitigate this problem, we take a bottleneck feature (BN) based TTS approach, in which TTS is decomposed into a Text-to-BN (T2BN) module to learn accent and a BN-to-Mel (BN2Me1) module to learn speaker timbre, where neural network based BN feature serves as the intermediate representation that are robust to noise interference. Second, direct training T2BN using the crowd-sourced data in the two-stage system will produce accent speech of target speaker with poor prosody. This is because the the crowd-sourced recordings are contributed from the ordinary unprofessional speakers. To tackle this problem, we update the two-stage approach to a novel three-stage approach, where T2BN and BN2Me1 are trained using the high-quality target speaker data and a new BN-to-BN module is plugged in between the two modules to perform accent transfer. To train the BN2BN module, the parallel unaccented and accented BN features are obtained by a proposed data augmentation procedure. Finally the proposed three-stage approach manages to produce accent speech for the target speaker with good prosody, as the prosody pattern is inherited from the professional target speaker and accent transfer is achieved by the BN2BN module at the same time. The proposed approach, named as AccentSpeech, is validated in a Mandarin TTS accent transfer task.

KW - accent transfer

KW - text to speech

UR - http://www.scopus.com/inward/record.url?scp=85148650945&partnerID=8YFLogxK

U2 - 10.1109/ISCSLP57327.2022.10037914

DO - 10.1109/ISCSLP57327.2022.10037914

M3 - 会议稿件

AN - SCOPUS:85148650945

T3 - 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022

SP - 76

EP - 80

BT - 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022

A2 - Lee, Kong Aik

A2 - Lee, Hung-yi

A2 - Lu, Yanfeng

A2 - Dong, Minghui

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 11 December 2022 through 14 December 2022

ER -

Zhang Y, Wang Z, Yang P, Sun H, Wang Z, Xie L. AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents. In Lee KA, Lee H, Lu Y, Dong M, editors, 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022. Institute of Electrical and Electronics Engineers Inc. 2022. p. 76-80. (2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022). doi: 10.1109/ISCSLP57327.2022.10037914

AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this