Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features

Ziqian Ning; Qicong Xie; Pengcheng Zhu; Zhichao Wang; Liumeng Xue; Jixun Yao; Lei Xie; Mengxiao Bi

doi:10.1109/ICASSP49357.2023.10096057

Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features

Ziqian Ning, Qicong Xie, Pengcheng Zhu, Zhichao Wang, Liumeng Xue, Jixun Yao, Lei Xie, Mengxiao Bi

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

14 Scopus citations

Abstract

Voice conversion for highly expressive speech is challenging. Current approaches struggle with the balance between speaker similarity, intelligibility, and expressiveness. To address this problem, we propose Expressive-VC, a novel end-to-end voice conversion framework that leverages advantages from both the neural bottleneck feature (BNF) approach and the information perturbation approach. Specifically, we use a BNF encoder and a Perturbed-Wav encoder to form a content extractor to learn linguistic and para-linguistic features respectively, where BNFs come from a robust pre-trained ASR model and the perturbed wave becomes speaker-irrelevant after signal perturbation. We further fuse the linguistic and para-linguistic features through an attention mechanism, where speaker-dependent prosody features are used as the attention query, which results from a prosody encoder with target speaker embedding and normalized pitch and energy of source speech as input. Finally, the decoder consumes the integrated features and the speaker-dependent prosody feature to generate the converted speech. Experiments show that Expressive-VC is superior to several popular systems, achieving both high expressiveness captured from the source speech and high speaker similarity with the target speaker; meanwhile intelligibility is well maintained.

Original language	English
Title of host publication	ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	9781728163277
DOIs	https://doi.org/10.1109/ICASSP49357.2023.10096057
State	Published - 2023
Event	48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 - Rhodes Island, Greece Duration: 4 Jun 2023 → 10 Jun 2023

Publication series

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume	2023-June
ISSN (Print)	1520-6149

Conference

Conference	48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023
Country/Territory	Greece
City	Rhodes Island
Period	4/06/23 → 10/06/23

Keywords

expressive
feature fusion
information perturbation
voice conversion

Access to Document

10.1109/ICASSP49357.2023.10096057

Cite this

Ning, Z., Xie, Q., Zhu, P., Wang, Z., Xue, L., Yao, J., Xie, L., & Bi, M. (2023). Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2023-June). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP49357.2023.10096057

Ning, Ziqian ; Xie, Qicong ; Zhu, Pengcheng et al. / Expressive-VC : Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings. Institute of Electrical and Electronics Engineers Inc., 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{2ad87eec092d47cd893058b2b99b968d,

title = "Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features",

abstract = "Voice conversion for highly expressive speech is challenging. Current approaches struggle with the balance between speaker similarity, intelligibility, and expressiveness. To address this problem, we propose Expressive-VC, a novel end-to-end voice conversion framework that leverages advantages from both the neural bottleneck feature (BNF) approach and the information perturbation approach. Specifically, we use a BNF encoder and a Perturbed-Wav encoder to form a content extractor to learn linguistic and para-linguistic features respectively, where BNFs come from a robust pre-trained ASR model and the perturbed wave becomes speaker-irrelevant after signal perturbation. We further fuse the linguistic and para-linguistic features through an attention mechanism, where speaker-dependent prosody features are used as the attention query, which results from a prosody encoder with target speaker embedding and normalized pitch and energy of source speech as input. Finally, the decoder consumes the integrated features and the speaker-dependent prosody feature to generate the converted speech. Experiments show that Expressive-VC is superior to several popular systems, achieving both high expressiveness captured from the source speech and high speaker similarity with the target speaker; meanwhile intelligibility is well maintained.",

keywords = "expressive, feature fusion, information perturbation, voice conversion",

author = "Ziqian Ning and Qicong Xie and Pengcheng Zhu and Zhichao Wang and Liumeng Xue and Jixun Yao and Lei Xie and Mengxiao Bi",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 ; Conference date: 04-06-2023 Through 10-06-2023",

year = "2023",

doi = "10.1109/ICASSP49357.2023.10096057",

language = "英语",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings",

}

Ning, Z, Xie, Q, Zhu, P, Wang, Z, Xue, L, Yao, J, Xie, L & Bi, M 2023, Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features. in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2023-June, Institute of Electrical and Electronics Engineers Inc., 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023, Rhodes Island, Greece, 4/06/23. https://doi.org/10.1109/ICASSP49357.2023.10096057

Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features. / Ning, Ziqian; Xie, Qicong; Zhu, Pengcheng et al.
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings. Institute of Electrical and Electronics Engineers Inc., 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2023-June).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Expressive-VC

T2 - 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

AU - Ning, Ziqian

AU - Xie, Qicong

AU - Zhu, Pengcheng

AU - Wang, Zhichao

AU - Xue, Liumeng

AU - Yao, Jixun

AU - Xie, Lei

AU - Bi, Mengxiao

PY - 2023

Y1 - 2023

N2 - Voice conversion for highly expressive speech is challenging. Current approaches struggle with the balance between speaker similarity, intelligibility, and expressiveness. To address this problem, we propose Expressive-VC, a novel end-to-end voice conversion framework that leverages advantages from both the neural bottleneck feature (BNF) approach and the information perturbation approach. Specifically, we use a BNF encoder and a Perturbed-Wav encoder to form a content extractor to learn linguistic and para-linguistic features respectively, where BNFs come from a robust pre-trained ASR model and the perturbed wave becomes speaker-irrelevant after signal perturbation. We further fuse the linguistic and para-linguistic features through an attention mechanism, where speaker-dependent prosody features are used as the attention query, which results from a prosody encoder with target speaker embedding and normalized pitch and energy of source speech as input. Finally, the decoder consumes the integrated features and the speaker-dependent prosody feature to generate the converted speech. Experiments show that Expressive-VC is superior to several popular systems, achieving both high expressiveness captured from the source speech and high speaker similarity with the target speaker; meanwhile intelligibility is well maintained.

AB - Voice conversion for highly expressive speech is challenging. Current approaches struggle with the balance between speaker similarity, intelligibility, and expressiveness. To address this problem, we propose Expressive-VC, a novel end-to-end voice conversion framework that leverages advantages from both the neural bottleneck feature (BNF) approach and the information perturbation approach. Specifically, we use a BNF encoder and a Perturbed-Wav encoder to form a content extractor to learn linguistic and para-linguistic features respectively, where BNFs come from a robust pre-trained ASR model and the perturbed wave becomes speaker-irrelevant after signal perturbation. We further fuse the linguistic and para-linguistic features through an attention mechanism, where speaker-dependent prosody features are used as the attention query, which results from a prosody encoder with target speaker embedding and normalized pitch and energy of source speech as input. Finally, the decoder consumes the integrated features and the speaker-dependent prosody feature to generate the converted speech. Experiments show that Expressive-VC is superior to several popular systems, achieving both high expressiveness captured from the source speech and high speaker similarity with the target speaker; meanwhile intelligibility is well maintained.

KW - expressive

KW - feature fusion

KW - information perturbation

KW - voice conversion

UR - http://www.scopus.com/inward/record.url?scp=85177556599&partnerID=8YFLogxK

U2 - 10.1109/ICASSP49357.2023.10096057

DO - 10.1109/ICASSP49357.2023.10096057

M3 - 会议稿件

AN - SCOPUS:85177556599

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

BT - ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 4 June 2023 through 10 June 2023

ER -

Ning Z, Xie Q, Zhu P, Wang Z, Xue L, Yao J et al. Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings. Institute of Electrical and Electronics Engineers Inc. 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP49357.2023.10096057

Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this