TY - GEN
T1 - Expressive-VC
T2 - 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023
AU - Ning, Ziqian
AU - Xie, Qicong
AU - Zhu, Pengcheng
AU - Wang, Zhichao
AU - Xue, Liumeng
AU - Yao, Jixun
AU - Xie, Lei
AU - Bi, Mengxiao
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Voice conversion for highly expressive speech is challenging. Current approaches struggle with the balance between speaker similarity, intelligibility, and expressiveness. To address this problem, we propose Expressive-VC, a novel end-to-end voice conversion framework that leverages advantages from both the neural bottleneck feature (BNF) approach and the information perturbation approach. Specifically, we use a BNF encoder and a Perturbed-Wav encoder to form a content extractor to learn linguistic and para-linguistic features respectively, where BNFs come from a robust pre-trained ASR model and the perturbed wave becomes speaker-irrelevant after signal perturbation. We further fuse the linguistic and para-linguistic features through an attention mechanism, where speaker-dependent prosody features are used as the attention query, which results from a prosody encoder with target speaker embedding and normalized pitch and energy of source speech as input. Finally, the decoder consumes the integrated features and the speaker-dependent prosody feature to generate the converted speech. Experiments show that Expressive-VC is superior to several popular systems, achieving both high expressiveness captured from the source speech and high speaker similarity with the target speaker; meanwhile intelligibility is well maintained.
AB - Voice conversion for highly expressive speech is challenging. Current approaches struggle with the balance between speaker similarity, intelligibility, and expressiveness. To address this problem, we propose Expressive-VC, a novel end-to-end voice conversion framework that leverages advantages from both the neural bottleneck feature (BNF) approach and the information perturbation approach. Specifically, we use a BNF encoder and a Perturbed-Wav encoder to form a content extractor to learn linguistic and para-linguistic features respectively, where BNFs come from a robust pre-trained ASR model and the perturbed wave becomes speaker-irrelevant after signal perturbation. We further fuse the linguistic and para-linguistic features through an attention mechanism, where speaker-dependent prosody features are used as the attention query, which results from a prosody encoder with target speaker embedding and normalized pitch and energy of source speech as input. Finally, the decoder consumes the integrated features and the speaker-dependent prosody feature to generate the converted speech. Experiments show that Expressive-VC is superior to several popular systems, achieving both high expressiveness captured from the source speech and high speaker similarity with the target speaker; meanwhile intelligibility is well maintained.
KW - expressive
KW - feature fusion
KW - information perturbation
KW - voice conversion
UR - http://www.scopus.com/inward/record.url?scp=85177556599&partnerID=8YFLogxK
U2 - 10.1109/ICASSP49357.2023.10096057
DO - 10.1109/ICASSP49357.2023.10096057
M3 - 会议稿件
AN - SCOPUS:85177556599
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
BT - ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 4 June 2023 through 10 June 2023
ER -