Enriching source style transfer in recognition-synthesis based non-parallel voice conversion

Zhichao Wang; Xinyong Zhou; Fengyu Yang; Tao Li; Hongqiang Du; Lei Xie; Wendong Gan; Haitao Chen; Hai Li

doi:10.21437/Interspeech.2021-1351

Enriching source style transfer in recognition-synthesis based non-parallel voice conversion

Zhichao Wang, Xinyong Zhou, Fengyu Yang, Tao Li, Hongqiang Du, Lei Xie, Wendong Gan, Haitao Chen, Hai Li

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

6 Scopus citations

Abstract

Current voice conversion (VC) methods can successfully convert timbre of the audio. As modeling source audio's prosody effectively is a challenging task, there are still limitations of transferring source style to the converted speech. This study proposes a source style transfer method based on recognitionsynthesis framework. Previously in speech generation task, prosody can be modeled explicitly with prosodic features or implicitly with a latent prosody extractor. In this paper, taking advantages of both, we model the prosody in a hybrid manner, which effectively combines explicit and implicit methods in a proposed prosody module. Specifically, prosodic features are used to explicit model prosody, while VAE and reference encoder are used to implicitly model prosody, which take Mel spectrum and bottleneck feature as input respectively. Furthermore, adversarial training is introduced to remove speakerrelated information from the VAE outputs, avoiding leaking source speaker information while transferring style. Finally, we use a modified self-attention based encoder to extract sentential context from bottleneck features, which also implicitly aggregates the prosodic aspects of source speech from the layered representations. Experiments show that our approach is superior to the baseline and a competitive system in terms of style transfer; meanwhile, the speech quality and speaker similarity are well maintained.

Original language	English
Title of host publication	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Publisher	International Speech Communication Association
Pages	4820-4824
Number of pages	5
ISBN (Electronic)	9781713836902
DOIs	https://doi.org/10.21437/Interspeech.2021-1351
State	Published - 2021
Event	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 - Brno, Czech Republic Duration: 30 Aug 2021 → 3 Sep 2021

Publication series

Name	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	6
ISSN (Print)	2308-457X
ISSN (Electronic)	1990-9772

Conference

Conference	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Country/Territory	Czech Republic
City	Brno
Period	30/08/21 → 3/09/21

Keywords

Hybrid modeling
Style transfer
Voice conversion

Access to Document

10.21437/Interspeech.2021-1351

Cite this

Wang, Z., Zhou, X., Yang, F., Li, T., Du, H., Xie, L., Gan, W., Chen, H., & Li, H. (2021). Enriching source style transfer in recognition-synthesis based non-parallel voice conversion. In 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 (pp. 4820-4824). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 6). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2021-1351

Wang, Zhichao ; Zhou, Xinyong ; Yang, Fengyu et al. / Enriching source style transfer in recognition-synthesis based non-parallel voice conversion. 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association, 2021. pp. 4820-4824 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH).

@inproceedings{db6ec259212f49a2bfe54f7160db70ef,

title = "Enriching source style transfer in recognition-synthesis based non-parallel voice conversion",

abstract = "Current voice conversion (VC) methods can successfully convert timbre of the audio. As modeling source audio's prosody effectively is a challenging task, there are still limitations of transferring source style to the converted speech. This study proposes a source style transfer method based on recognitionsynthesis framework. Previously in speech generation task, prosody can be modeled explicitly with prosodic features or implicitly with a latent prosody extractor. In this paper, taking advantages of both, we model the prosody in a hybrid manner, which effectively combines explicit and implicit methods in a proposed prosody module. Specifically, prosodic features are used to explicit model prosody, while VAE and reference encoder are used to implicitly model prosody, which take Mel spectrum and bottleneck feature as input respectively. Furthermore, adversarial training is introduced to remove speakerrelated information from the VAE outputs, avoiding leaking source speaker information while transferring style. Finally, we use a modified self-attention based encoder to extract sentential context from bottleneck features, which also implicitly aggregates the prosodic aspects of source speech from the layered representations. Experiments show that our approach is superior to the baseline and a competitive system in terms of style transfer; meanwhile, the speech quality and speaker similarity are well maintained.",

keywords = "Hybrid modeling, Style transfer, Voice conversion",

author = "Zhichao Wang and Xinyong Zhou and Fengyu Yang and Tao Li and Hongqiang Du and Lei Xie and Wendong Gan and Haitao Chen and Hai Li",

note = "Publisher Copyright: Copyright {\textcopyright} 2021 ISCA.; 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 ; Conference date: 30-08-2021 Through 03-09-2021",

year = "2021",

doi = "10.21437/Interspeech.2021-1351",

language = "英语",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

publisher = "International Speech Communication Association",

pages = "4820--4824",

booktitle = "22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021",

}

Wang, Z, Zhou, X, Yang, F, Li, T, Du, H, Xie, L, Gan, W, Chen, H & Li, H 2021, Enriching source style transfer in recognition-synthesis based non-parallel voice conversion. in 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 6, International Speech Communication Association, pp. 4820-4824, 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, Brno, Czech Republic, 30/08/21. https://doi.org/10.21437/Interspeech.2021-1351

Enriching source style transfer in recognition-synthesis based non-parallel voice conversion. / Wang, Zhichao; Zhou, Xinyong; Yang, Fengyu et al.
22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association, 2021. p. 4820-4824 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 6).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Enriching source style transfer in recognition-synthesis based non-parallel voice conversion

AU - Wang, Zhichao

AU - Zhou, Xinyong

AU - Yang, Fengyu

AU - Li, Tao

AU - Du, Hongqiang

AU - Xie, Lei

AU - Gan, Wendong

AU - Chen, Haitao

AU - Li, Hai

PY - 2021

Y1 - 2021

N2 - Current voice conversion (VC) methods can successfully convert timbre of the audio. As modeling source audio's prosody effectively is a challenging task, there are still limitations of transferring source style to the converted speech. This study proposes a source style transfer method based on recognitionsynthesis framework. Previously in speech generation task, prosody can be modeled explicitly with prosodic features or implicitly with a latent prosody extractor. In this paper, taking advantages of both, we model the prosody in a hybrid manner, which effectively combines explicit and implicit methods in a proposed prosody module. Specifically, prosodic features are used to explicit model prosody, while VAE and reference encoder are used to implicitly model prosody, which take Mel spectrum and bottleneck feature as input respectively. Furthermore, adversarial training is introduced to remove speakerrelated information from the VAE outputs, avoiding leaking source speaker information while transferring style. Finally, we use a modified self-attention based encoder to extract sentential context from bottleneck features, which also implicitly aggregates the prosodic aspects of source speech from the layered representations. Experiments show that our approach is superior to the baseline and a competitive system in terms of style transfer; meanwhile, the speech quality and speaker similarity are well maintained.

AB - Current voice conversion (VC) methods can successfully convert timbre of the audio. As modeling source audio's prosody effectively is a challenging task, there are still limitations of transferring source style to the converted speech. This study proposes a source style transfer method based on recognitionsynthesis framework. Previously in speech generation task, prosody can be modeled explicitly with prosodic features or implicitly with a latent prosody extractor. In this paper, taking advantages of both, we model the prosody in a hybrid manner, which effectively combines explicit and implicit methods in a proposed prosody module. Specifically, prosodic features are used to explicit model prosody, while VAE and reference encoder are used to implicitly model prosody, which take Mel spectrum and bottleneck feature as input respectively. Furthermore, adversarial training is introduced to remove speakerrelated information from the VAE outputs, avoiding leaking source speaker information while transferring style. Finally, we use a modified self-attention based encoder to extract sentential context from bottleneck features, which also implicitly aggregates the prosodic aspects of source speech from the layered representations. Experiments show that our approach is superior to the baseline and a competitive system in terms of style transfer; meanwhile, the speech quality and speaker similarity are well maintained.

KW - Hybrid modeling

KW - Style transfer

KW - Voice conversion

UR - http://www.scopus.com/inward/record.url?scp=85119277682&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2021-1351

DO - 10.21437/Interspeech.2021-1351

M3 - 会议稿件

AN - SCOPUS:85119277682

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 4820

EP - 4824

BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021

PB - International Speech Communication Association

T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021

Y2 - 30 August 2021 through 3 September 2021

ER -

Wang Z, Zhou X, Yang F, Li T, Du H, Xie L et al. Enriching source style transfer in recognition-synthesis based non-parallel voice conversion. In 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association. 2021. p. 4820-4824. (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH). doi: 10.21437/Interspeech.2021-1351

Enriching source style transfer in recognition-synthesis based non-parallel voice conversion

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this