Improving performance of seen and unseen speech style transfer in end-to-end neural TTS

Xiaochun An; Frank K. Soong; Lei Xie

doi:10.21437/Interspeech.2021-1407

Improving performance of seen and unseen speech style transfer in end-to-end neural TTS

Xiaochun An, Frank K. Soong, Lei Xie

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

2 引用（Scopus）

摘要

End-to-end neural TTS training has shown improved performance in speech style transfer. However, the improvement is still limited by the training data in both target styles and speakers. Inadequate style transfer performance occurs when the trained TTS tries to transfer the speech to a target style from a new speaker with an unknown, arbitrary style. In this paper, we propose a new approach to style transfer for both seen and unseen styles, with disjoint, multi-style datasets, i.e., datasets of different styles are recorded, each individual style is by one speaker with multiple utterances. To encode the style information, we adopt an inverse autoregressive flow (IAF) structure to improve the variational inference. The whole system is optimized to minimize a weighed sum of four different loss functions: 1) a reconstruction loss to measure the distortions in both source and target reconstructions; 2) an adversarial loss to “fool” a well-trained discriminator; 3) a style distortion loss to measure the expected style loss after the transfer; 4) a cycle consistency loss to preserve the speaker identity of the source after the transfer. Experiments demonstrate, both objectively and subjectively, the effectiveness of the proposed approach for seen and unseen style transfer tasks. The performance of the new approach is better and more robust than those of four baseline systems of the prior art.

源语言	英语
主期刊名	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
出版商	International Speech Communication Association
页	3466-3470
页数	5
ISBN（电子版）	9781713836902
DOI	https://doi.org/10.21437/Interspeech.2021-1407
出版状态	已出版 - 2021
活动	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 - Brno, 捷克共和国期限: 30 8月 2021 → 3 9月 2021

出版系列

姓名	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷	5
ISSN（印刷版）	2308-457X
ISSN（电子版）	1990-9772

会议

会议	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
国家/地区	捷克共和国
市	Brno
时期	30/08/21 → 3/09/21

访问文件

10.21437/Interspeech.2021-1407

其它文件与链接

链接到 Scopus 的出版物

引用此

An, X., Soong, F. K., & Xie, L. (2021). Improving performance of seen and unseen speech style transfer in end-to-end neural TTS. 在 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 (页码 3466-3470). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; 卷 5). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2021-1407

An, Xiaochun ; Soong, Frank K. ; Xie, Lei. / Improving performance of seen and unseen speech style transfer in end-to-end neural TTS. 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association, 2021. 页码 3466-3470 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH).

@inproceedings{49549785d47e428da1b208c3d6d2f95c,

title = "Improving performance of seen and unseen speech style transfer in end-to-end neural TTS",

abstract = "End-to-end neural TTS training has shown improved performance in speech style transfer. However, the improvement is still limited by the training data in both target styles and speakers. Inadequate style transfer performance occurs when the trained TTS tries to transfer the speech to a target style from a new speaker with an unknown, arbitrary style. In this paper, we propose a new approach to style transfer for both seen and unseen styles, with disjoint, multi-style datasets, i.e., datasets of different styles are recorded, each individual style is by one speaker with multiple utterances. To encode the style information, we adopt an inverse autoregressive flow (IAF) structure to improve the variational inference. The whole system is optimized to minimize a weighed sum of four different loss functions: 1) a reconstruction loss to measure the distortions in both source and target reconstructions; 2) an adversarial loss to “fool” a well-trained discriminator; 3) a style distortion loss to measure the expected style loss after the transfer; 4) a cycle consistency loss to preserve the speaker identity of the source after the transfer. Experiments demonstrate, both objectively and subjectively, the effectiveness of the proposed approach for seen and unseen style transfer tasks. The performance of the new approach is better and more robust than those of four baseline systems of the prior art.",

keywords = "Cycle consistency, Disjoint datasets, Neural TTS, Style distortion, Style transfer",

author = "Xiaochun An and Soong, {Frank K.} and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2021 ISCA; 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 ; Conference date: 30-08-2021 Through 03-09-2021",

year = "2021",

doi = "10.21437/Interspeech.2021-1407",

language = "英语",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

publisher = "International Speech Communication Association",

pages = "3466--3470",

booktitle = "22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021",

}

An, X, Soong, FK & Xie, L 2021, Improving performance of seen and unseen speech style transfer in end-to-end neural TTS. 在 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 卷 5, International Speech Communication Association, 页码 3466-3470, 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, Brno, 捷克共和国, 30/08/21. https://doi.org/10.21437/Interspeech.2021-1407

Improving performance of seen and unseen speech style transfer in end-to-end neural TTS. / An, Xiaochun; Soong, Frank K.; Xie, Lei.
22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association, 2021. 页码 3466-3470 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; 卷 5).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Improving performance of seen and unseen speech style transfer in end-to-end neural TTS

AU - An, Xiaochun

AU - Soong, Frank K.

AU - Xie, Lei

PY - 2021

Y1 - 2021

N2 - End-to-end neural TTS training has shown improved performance in speech style transfer. However, the improvement is still limited by the training data in both target styles and speakers. Inadequate style transfer performance occurs when the trained TTS tries to transfer the speech to a target style from a new speaker with an unknown, arbitrary style. In this paper, we propose a new approach to style transfer for both seen and unseen styles, with disjoint, multi-style datasets, i.e., datasets of different styles are recorded, each individual style is by one speaker with multiple utterances. To encode the style information, we adopt an inverse autoregressive flow (IAF) structure to improve the variational inference. The whole system is optimized to minimize a weighed sum of four different loss functions: 1) a reconstruction loss to measure the distortions in both source and target reconstructions; 2) an adversarial loss to “fool” a well-trained discriminator; 3) a style distortion loss to measure the expected style loss after the transfer; 4) a cycle consistency loss to preserve the speaker identity of the source after the transfer. Experiments demonstrate, both objectively and subjectively, the effectiveness of the proposed approach for seen and unseen style transfer tasks. The performance of the new approach is better and more robust than those of four baseline systems of the prior art.

AB - End-to-end neural TTS training has shown improved performance in speech style transfer. However, the improvement is still limited by the training data in both target styles and speakers. Inadequate style transfer performance occurs when the trained TTS tries to transfer the speech to a target style from a new speaker with an unknown, arbitrary style. In this paper, we propose a new approach to style transfer for both seen and unseen styles, with disjoint, multi-style datasets, i.e., datasets of different styles are recorded, each individual style is by one speaker with multiple utterances. To encode the style information, we adopt an inverse autoregressive flow (IAF) structure to improve the variational inference. The whole system is optimized to minimize a weighed sum of four different loss functions: 1) a reconstruction loss to measure the distortions in both source and target reconstructions; 2) an adversarial loss to “fool” a well-trained discriminator; 3) a style distortion loss to measure the expected style loss after the transfer; 4) a cycle consistency loss to preserve the speaker identity of the source after the transfer. Experiments demonstrate, both objectively and subjectively, the effectiveness of the proposed approach for seen and unseen style transfer tasks. The performance of the new approach is better and more robust than those of four baseline systems of the prior art.

KW - Cycle consistency

KW - Disjoint datasets

KW - Neural TTS

KW - Style distortion

KW - Style transfer

UR - http://www.scopus.com/inward/record.url?scp=85119204746&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2021-1407

DO - 10.21437/Interspeech.2021-1407

M3 - 会议稿件

AN - SCOPUS:85119204746

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 3466

EP - 3470

BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021

PB - International Speech Communication Association

T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021

Y2 - 30 August 2021 through 3 September 2021

ER -

An X, Soong FK, Xie L. Improving performance of seen and unseen speech style transfer in end-to-end neural TTS. 在 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association. 2021. 页码 3466-3470. (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH). doi: 10.21437/Interspeech.2021-1407

Improving performance of seen and unseen speech style transfer in end-to-end neural TTS

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此