Cycle consistent network for end-to-end style transfer TTS training

Liumeng Xue; Shifeng Pan; Lei He; Lei Xie; Frank K. Soong

doi:10.1016/j.neunet.2021.03.005

Cycle consistent network for end-to-end style transfer TTS training

Liumeng Xue, Shifeng Pan, Lei He, Lei Xie, Frank K. Soong

School of Computer Science

Research output: Contribution to journal › Article › peer-review

19 Scopus citations

Abstract

In this paper, we propose a cycle consistent network based end-to-end TTS for speaking style transfer, including intra-speaker, inter-speaker, and unseen speaker style transfer for both parallel and unparallel transfers. The proposed approach is built upon a multi-speaker Variational Autoencoder (VAE) TTS model. The model is usually trained in a paired manner, which means the reference speech is totally paired with the output including speaker identity, text, and style. To achieve a better quality for style transfer, which for most cases is in an unpaired manner, we augment the model with an unpaired path with a separated variational style encoder. The unpaired path takes as input an unpaired reference speech and yields an unpaired output. The unpaired output, which lacks direct ground-truth target, is then successfully constrained by a delicately designed cycle consistent network. Specifically, the unpaired output of the forward transfer is fed into the model again as an unpaired reference input, and after the backward transfer yields an output expected to be the same as the original unpaired reference speech. Ablation study shows the effectiveness of the unpaired path, separated style encoders and cycle consistent network in the proposed model. The final evaluation demonstrates the proposed approach significantly outperforms the Global Style Token (GST) and VAE based systems for all the six style transfer categories, in metrics of naturalness, speech quality, similarity of speaker identity, and similarity of speaking style.

Original language	English
Pages (from-to)	223-236
Number of pages	14
Journal	Neural Networks
Volume	140
DOIs	https://doi.org/10.1016/j.neunet.2021.03.005
State	Published - Aug 2021

Keywords

Cycle consistent
End-to-end
Speech synthesis
Style transfer
Variational autoencoder

Access to Document

10.1016/j.neunet.2021.03.005

Cite this

@article{ab21d79f09b44132af88a22f0fdacab4,

title = "Cycle consistent network for end-to-end style transfer TTS training",

abstract = "In this paper, we propose a cycle consistent network based end-to-end TTS for speaking style transfer, including intra-speaker, inter-speaker, and unseen speaker style transfer for both parallel and unparallel transfers. The proposed approach is built upon a multi-speaker Variational Autoencoder (VAE) TTS model. The model is usually trained in a paired manner, which means the reference speech is totally paired with the output including speaker identity, text, and style. To achieve a better quality for style transfer, which for most cases is in an unpaired manner, we augment the model with an unpaired path with a separated variational style encoder. The unpaired path takes as input an unpaired reference speech and yields an unpaired output. The unpaired output, which lacks direct ground-truth target, is then successfully constrained by a delicately designed cycle consistent network. Specifically, the unpaired output of the forward transfer is fed into the model again as an unpaired reference input, and after the backward transfer yields an output expected to be the same as the original unpaired reference speech. Ablation study shows the effectiveness of the unpaired path, separated style encoders and cycle consistent network in the proposed model. The final evaluation demonstrates the proposed approach significantly outperforms the Global Style Token (GST) and VAE based systems for all the six style transfer categories, in metrics of naturalness, speech quality, similarity of speaker identity, and similarity of speaking style.",

keywords = "Cycle consistent, End-to-end, Speech synthesis, Style transfer, Variational autoencoder",

author = "Liumeng Xue and Shifeng Pan and Lei He and Lei Xie and Soong, {Frank K.}",

note = "Publisher Copyright: {\textcopyright} 2021 Elsevier Ltd",

year = "2021",

month = aug,

doi = "10.1016/j.neunet.2021.03.005",

language = "英语",

volume = "140",

pages = "223--236",

journal = "Neural Networks",

issn = "0893-6080",

publisher = "Elsevier Ltd",

}

TY - JOUR

T1 - Cycle consistent network for end-to-end style transfer TTS training

AU - Xue, Liumeng

AU - Pan, Shifeng

AU - He, Lei

AU - Xie, Lei

AU - Soong, Frank K.

PY - 2021/8

Y1 - 2021/8

N2 - In this paper, we propose a cycle consistent network based end-to-end TTS for speaking style transfer, including intra-speaker, inter-speaker, and unseen speaker style transfer for both parallel and unparallel transfers. The proposed approach is built upon a multi-speaker Variational Autoencoder (VAE) TTS model. The model is usually trained in a paired manner, which means the reference speech is totally paired with the output including speaker identity, text, and style. To achieve a better quality for style transfer, which for most cases is in an unpaired manner, we augment the model with an unpaired path with a separated variational style encoder. The unpaired path takes as input an unpaired reference speech and yields an unpaired output. The unpaired output, which lacks direct ground-truth target, is then successfully constrained by a delicately designed cycle consistent network. Specifically, the unpaired output of the forward transfer is fed into the model again as an unpaired reference input, and after the backward transfer yields an output expected to be the same as the original unpaired reference speech. Ablation study shows the effectiveness of the unpaired path, separated style encoders and cycle consistent network in the proposed model. The final evaluation demonstrates the proposed approach significantly outperforms the Global Style Token (GST) and VAE based systems for all the six style transfer categories, in metrics of naturalness, speech quality, similarity of speaker identity, and similarity of speaking style.

AB - In this paper, we propose a cycle consistent network based end-to-end TTS for speaking style transfer, including intra-speaker, inter-speaker, and unseen speaker style transfer for both parallel and unparallel transfers. The proposed approach is built upon a multi-speaker Variational Autoencoder (VAE) TTS model. The model is usually trained in a paired manner, which means the reference speech is totally paired with the output including speaker identity, text, and style. To achieve a better quality for style transfer, which for most cases is in an unpaired manner, we augment the model with an unpaired path with a separated variational style encoder. The unpaired path takes as input an unpaired reference speech and yields an unpaired output. The unpaired output, which lacks direct ground-truth target, is then successfully constrained by a delicately designed cycle consistent network. Specifically, the unpaired output of the forward transfer is fed into the model again as an unpaired reference input, and after the backward transfer yields an output expected to be the same as the original unpaired reference speech. Ablation study shows the effectiveness of the unpaired path, separated style encoders and cycle consistent network in the proposed model. The final evaluation demonstrates the proposed approach significantly outperforms the Global Style Token (GST) and VAE based systems for all the six style transfer categories, in metrics of naturalness, speech quality, similarity of speaker identity, and similarity of speaking style.

KW - Cycle consistent

KW - End-to-end

KW - Speech synthesis

KW - Style transfer

KW - Variational autoencoder

UR - http://www.scopus.com/inward/record.url?scp=85103246042&partnerID=8YFLogxK

U2 - 10.1016/j.neunet.2021.03.005

DO - 10.1016/j.neunet.2021.03.005

M3 - 文章

C2 - 33780874

AN - SCOPUS:85103246042

SN - 0893-6080

VL - 140

SP - 223

EP - 236

JO - Neural Networks

JF - Neural Networks

ER -

Cycle consistent network for end-to-end style transfer TTS training

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this