Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

Hongqiang Du, Lei Xie

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training. However, available one-shot voice conversion approaches are not stable for unseen speakers as the speaker embedding extracted from one utterance of an unseen speaker is not reliable. In this paper, we propose a deep discriminative speaker encoder to extract speaker embedding from one utterance more effectively. Specifically, the speaker encoder first integrates residual network and squeeze-and-excitation network to extract discriminative speaker information in frame level by modeling framewise and channel-wise interdependence in features. Then attention mechanism is introduced to further emphasize speaker related information via assigning different weights to frame level speaker information. Finally a statistic pooling layer is used to aggregate weighted frame level speaker information to form utterance level speaker embedding. The experimental results demonstrate that our proposed speaker encoder can improve the robustness of one-shot voice conversion for unseen speakers and outperforms baseline systems in terms of speech quality and speaker similarity.

源语言英语
主期刊名22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
出版商International Speech Communication Association
4725-4729
页数5
ISBN(电子版)9781713836902
DOI
出版状态已出版 - 2021
活动22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 - Brno, 捷克共和国
期限: 30 8月 20213 9月 2021

出版系列

姓名Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
6
ISSN(印刷版)2308-457X
ISSN(电子版)1990-9772

会议

会议22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
国家/地区捷克共和国
Brno
时期30/08/213/09/21

指纹

探究 'Improving robustness of one-shot voice conversion with deep discriminative speaker encoder' 的科研主题。它们共同构成独一无二的指纹。

引用此