End-to-End Voice Conversion with Information Perturbation

Qicong Xie; Shan Yang; Yi Lei; Lei Xie; Dan Su

doi:10.1109/ISCSLP57327.2022.10037890

End-to-End Voice Conversion with Information Perturbation

Qicong Xie, Shan Yang, Yi Lei, Lei Xie, Dan Su

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

4 Scopus citations

Abstract

The ideal goal of voice conversion is to convert the source speaker's speech to sound naturally like the target speaker while maintaining the linguistic content and the prosody of the source speech. However, current approaches are insufficient to achieve comprehensive source prosody transfer and target speaker timbre preservation in the converted speech, and the quality of the converted speech is also unsatisfied due to the mismatch between the acoustic model and the vocoder. In this paper, we leverage the recent advances in information perturbation and propose a fully end-to-end approach to conduct high-quality voice conversion. We first adopt information perturbation to remove speaker-related information in the source speech to disentangle speaker timbre and linguistic content and thus the linguistic information is subsequently modeled by a content encoder. To better transfer the prosody of the source speech to the target, we particularly introduce a speaker-related pitch encoder which can maintain the general pitch pattern of the source speaker while flexibly modifying the pitch intensity of the generated speech. Finally, one-shot voice conversion is set up through continuous speaker space modeling. Experimental results indicate that the proposed end-to-end approach significantly outperforms the state-of-the-art models in terms of intelligibility, naturalness, and speaker similarity.

Original language	English
Title of host publication	2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022
Editors	Kong Aik Lee, Hung-yi Lee, Yanfeng Lu, Minghui Dong
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	91-95
Number of pages	5
ISBN (Electronic)	9798350397963
DOIs	https://doi.org/10.1109/ISCSLP57327.2022.10037890
State	Published - 2022
Event	13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022 - Singapore, Singapore Duration: 11 Dec 2022 → 14 Dec 2022

Publication series

Name	2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022

Conference

Conference	13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022
Country/Territory	Singapore
City	Singapore
Period	11/12/22 → 14/12/22

Keywords

any-to-any
end-to-end
voice conversion

Access to Document

10.1109/ISCSLP57327.2022.10037890

Cite this

Xie, Q., Yang, S., Lei, Y., Xie, L., & Su, D. (2022). End-to-End Voice Conversion with Information Perturbation. In K. A. Lee, H. Lee, Y. Lu, & M. Dong (Eds.), 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022 (pp. 91-95). (2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ISCSLP57327.2022.10037890

Xie, Qicong ; Yang, Shan ; Lei, Yi et al. / End-to-End Voice Conversion with Information Perturbation. 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022. editor / Kong Aik Lee ; Hung-yi Lee ; Yanfeng Lu ; Minghui Dong. Institute of Electrical and Electronics Engineers Inc., 2022. pp. 91-95 (2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022).

@inproceedings{e35103cac3214c668d82713eb92b9098,

title = "End-to-End Voice Conversion with Information Perturbation",

abstract = "The ideal goal of voice conversion is to convert the source speaker's speech to sound naturally like the target speaker while maintaining the linguistic content and the prosody of the source speech. However, current approaches are insufficient to achieve comprehensive source prosody transfer and target speaker timbre preservation in the converted speech, and the quality of the converted speech is also unsatisfied due to the mismatch between the acoustic model and the vocoder. In this paper, we leverage the recent advances in information perturbation and propose a fully end-to-end approach to conduct high-quality voice conversion. We first adopt information perturbation to remove speaker-related information in the source speech to disentangle speaker timbre and linguistic content and thus the linguistic information is subsequently modeled by a content encoder. To better transfer the prosody of the source speech to the target, we particularly introduce a speaker-related pitch encoder which can maintain the general pitch pattern of the source speaker while flexibly modifying the pitch intensity of the generated speech. Finally, one-shot voice conversion is set up through continuous speaker space modeling. Experimental results indicate that the proposed end-to-end approach significantly outperforms the state-of-the-art models in terms of intelligibility, naturalness, and speaker similarity.",

keywords = "any-to-any, end-to-end, voice conversion",

author = "Qicong Xie and Shan Yang and Yi Lei and Lei Xie and Dan Su",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE.; 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022 ; Conference date: 11-12-2022 Through 14-12-2022",

year = "2022",

doi = "10.1109/ISCSLP57327.2022.10037890",

language = "英语",

series = "2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "91--95",

editor = "Lee, {Kong Aik} and Hung-yi Lee and Yanfeng Lu and Minghui Dong",

booktitle = "2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022",

}

Xie, Q, Yang, S, Lei, Y, Xie, L & Su, D 2022, End-to-End Voice Conversion with Information Perturbation. in KA Lee, H Lee, Y Lu & M Dong (eds), 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022. 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022, Institute of Electrical and Electronics Engineers Inc., pp. 91-95, 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022, Singapore, Singapore, 11/12/22. https://doi.org/10.1109/ISCSLP57327.2022.10037890

End-to-End Voice Conversion with Information Perturbation. / Xie, Qicong; Yang, Shan; Lei, Yi et al.
2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022. ed. / Kong Aik Lee; Hung-yi Lee; Yanfeng Lu; Minghui Dong. Institute of Electrical and Electronics Engineers Inc., 2022. p. 91-95 (2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - End-to-End Voice Conversion with Information Perturbation

AU - Xie, Qicong

AU - Yang, Shan

AU - Lei, Yi

AU - Xie, Lei

AU - Su, Dan

PY - 2022

Y1 - 2022

N2 - The ideal goal of voice conversion is to convert the source speaker's speech to sound naturally like the target speaker while maintaining the linguistic content and the prosody of the source speech. However, current approaches are insufficient to achieve comprehensive source prosody transfer and target speaker timbre preservation in the converted speech, and the quality of the converted speech is also unsatisfied due to the mismatch between the acoustic model and the vocoder. In this paper, we leverage the recent advances in information perturbation and propose a fully end-to-end approach to conduct high-quality voice conversion. We first adopt information perturbation to remove speaker-related information in the source speech to disentangle speaker timbre and linguistic content and thus the linguistic information is subsequently modeled by a content encoder. To better transfer the prosody of the source speech to the target, we particularly introduce a speaker-related pitch encoder which can maintain the general pitch pattern of the source speaker while flexibly modifying the pitch intensity of the generated speech. Finally, one-shot voice conversion is set up through continuous speaker space modeling. Experimental results indicate that the proposed end-to-end approach significantly outperforms the state-of-the-art models in terms of intelligibility, naturalness, and speaker similarity.

AB - The ideal goal of voice conversion is to convert the source speaker's speech to sound naturally like the target speaker while maintaining the linguistic content and the prosody of the source speech. However, current approaches are insufficient to achieve comprehensive source prosody transfer and target speaker timbre preservation in the converted speech, and the quality of the converted speech is also unsatisfied due to the mismatch between the acoustic model and the vocoder. In this paper, we leverage the recent advances in information perturbation and propose a fully end-to-end approach to conduct high-quality voice conversion. We first adopt information perturbation to remove speaker-related information in the source speech to disentangle speaker timbre and linguistic content and thus the linguistic information is subsequently modeled by a content encoder. To better transfer the prosody of the source speech to the target, we particularly introduce a speaker-related pitch encoder which can maintain the general pitch pattern of the source speaker while flexibly modifying the pitch intensity of the generated speech. Finally, one-shot voice conversion is set up through continuous speaker space modeling. Experimental results indicate that the proposed end-to-end approach significantly outperforms the state-of-the-art models in terms of intelligibility, naturalness, and speaker similarity.

KW - any-to-any

KW - end-to-end

KW - voice conversion

UR - http://www.scopus.com/inward/record.url?scp=85148581958&partnerID=8YFLogxK

U2 - 10.1109/ISCSLP57327.2022.10037890

DO - 10.1109/ISCSLP57327.2022.10037890

M3 - 会议稿件

AN - SCOPUS:85148581958

T3 - 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022

SP - 91

EP - 95

BT - 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022

A2 - Lee, Kong Aik

A2 - Lee, Hung-yi

A2 - Lu, Yanfeng

A2 - Dong, Minghui

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022

Y2 - 11 December 2022 through 14 December 2022

ER -

Xie Q, Yang S, Lei Y, Xie L, Su D. End-to-End Voice Conversion with Information Perturbation. In Lee KA, Lee H, Lu Y, Dong M, editors, 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022. Institute of Electrical and Electronics Engineers Inc. 2022. p. 91-95. (2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022). doi: 10.1109/ISCSLP57327.2022.10037890

End-to-End Voice Conversion with Information Perturbation

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this