Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

Hongqiang Du; Lei Xie

doi:10.21437/Interspeech.2021-2132

Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

Hongqiang Du, Lei Xie

计算机学院

Northwestern Polytechnical University Xian

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

摘要

One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training. However, available one-shot voice conversion approaches are not stable for unseen speakers as the speaker embedding extracted from one utterance of an unseen speaker is not reliable. In this paper, we propose a deep discriminative speaker encoder to extract speaker embedding from one utterance more effectively. Specifically, the speaker encoder first integrates residual network and squeeze-and-excitation network to extract discriminative speaker information in frame level by modeling framewise and channel-wise interdependence in features. Then attention mechanism is introduced to further emphasize speaker related information via assigning different weights to frame level speaker information. Finally a statistic pooling layer is used to aggregate weighted frame level speaker information to form utterance level speaker embedding. The experimental results demonstrate that our proposed speaker encoder can improve the robustness of one-shot voice conversion for unseen speakers and outperforms baseline systems in terms of speech quality and speaker similarity.

源语言	英语
主期刊名	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
出版商	International Speech Communication Association
页	4725-4729
页数	5
ISBN（电子版）	9781713836902
DOI	https://doi.org/10.21437/Interspeech.2021-2132
出版状态	已出版 - 2021
活动	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 - Brno, 捷克共和国期限: 30 8月 2021 → 3 9月 2021

出版系列

姓名	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷	6
ISSN（印刷版）	2308-457X
ISSN（电子版）	1990-9772

会议

会议	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
国家/地区	捷克共和国
市	Brno
时期	30/08/21 → 3/09/21

访问文件

10.21437/Interspeech.2021-2132

其它文件与链接

链接到 Scopus 的出版物

引用此

Du, H., & Xie, L. (2021). Improving robustness of one-shot voice conversion with deep discriminative speaker encoder. 在 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 (页码 4725-4729). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; 卷 6). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2021-2132

Du, Hongqiang ; Xie, Lei. / Improving robustness of one-shot voice conversion with deep discriminative speaker encoder. 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association, 2021. 页码 4725-4729 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH).

@inproceedings{2d177e668bae4e0c9801a40ad9184bf4,

title = "Improving robustness of one-shot voice conversion with deep discriminative speaker encoder",

abstract = "One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training. However, available one-shot voice conversion approaches are not stable for unseen speakers as the speaker embedding extracted from one utterance of an unseen speaker is not reliable. In this paper, we propose a deep discriminative speaker encoder to extract speaker embedding from one utterance more effectively. Specifically, the speaker encoder first integrates residual network and squeeze-and-excitation network to extract discriminative speaker information in frame level by modeling framewise and channel-wise interdependence in features. Then attention mechanism is introduced to further emphasize speaker related information via assigning different weights to frame level speaker information. Finally a statistic pooling layer is used to aggregate weighted frame level speaker information to form utterance level speaker embedding. The experimental results demonstrate that our proposed speaker encoder can improve the robustness of one-shot voice conversion for unseen speakers and outperforms baseline systems in terms of speech quality and speaker similarity.",

keywords = "One-shot, Speaker embedding, Voice conversion",

author = "Hongqiang Du and Lei Xie",

note = "Publisher Copyright: Copyright {\textcopyright} 2021 ISCA.; 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 ; Conference date: 30-08-2021 Through 03-09-2021",

year = "2021",

doi = "10.21437/Interspeech.2021-2132",

language = "英语",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

publisher = "International Speech Communication Association",

pages = "4725--4729",

booktitle = "22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021",

}

Du, H & Xie, L 2021, Improving robustness of one-shot voice conversion with deep discriminative speaker encoder. 在 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 卷 6, International Speech Communication Association, 页码 4725-4729, 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, Brno, 捷克共和国, 30/08/21. https://doi.org/10.21437/Interspeech.2021-2132

Improving robustness of one-shot voice conversion with deep discriminative speaker encoder. / Du, Hongqiang; Xie, Lei.
22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association, 2021. 页码 4725-4729 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; 卷 6).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

AU - Du, Hongqiang

AU - Xie, Lei

PY - 2021

Y1 - 2021

N2 - One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training. However, available one-shot voice conversion approaches are not stable for unseen speakers as the speaker embedding extracted from one utterance of an unseen speaker is not reliable. In this paper, we propose a deep discriminative speaker encoder to extract speaker embedding from one utterance more effectively. Specifically, the speaker encoder first integrates residual network and squeeze-and-excitation network to extract discriminative speaker information in frame level by modeling framewise and channel-wise interdependence in features. Then attention mechanism is introduced to further emphasize speaker related information via assigning different weights to frame level speaker information. Finally a statistic pooling layer is used to aggregate weighted frame level speaker information to form utterance level speaker embedding. The experimental results demonstrate that our proposed speaker encoder can improve the robustness of one-shot voice conversion for unseen speakers and outperforms baseline systems in terms of speech quality and speaker similarity.

AB - One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training. However, available one-shot voice conversion approaches are not stable for unseen speakers as the speaker embedding extracted from one utterance of an unseen speaker is not reliable. In this paper, we propose a deep discriminative speaker encoder to extract speaker embedding from one utterance more effectively. Specifically, the speaker encoder first integrates residual network and squeeze-and-excitation network to extract discriminative speaker information in frame level by modeling framewise and channel-wise interdependence in features. Then attention mechanism is introduced to further emphasize speaker related information via assigning different weights to frame level speaker information. Finally a statistic pooling layer is used to aggregate weighted frame level speaker information to form utterance level speaker embedding. The experimental results demonstrate that our proposed speaker encoder can improve the robustness of one-shot voice conversion for unseen speakers and outperforms baseline systems in terms of speech quality and speaker similarity.

KW - One-shot

KW - Speaker embedding

KW - Voice conversion

UR - http://www.scopus.com/inward/record.url?scp=85119301023&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2021-2132

DO - 10.21437/Interspeech.2021-2132

M3 - 会议稿件

AN - SCOPUS:85119301023

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 4725

EP - 4729

BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021

PB - International Speech Communication Association

T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021

Y2 - 30 August 2021 through 3 September 2021

ER -

Du H, Xie L. Improving robustness of one-shot voice conversion with deep discriminative speaker encoder. 在 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association. 2021. 页码 4725-4729. (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH). doi: 10.21437/Interspeech.2021-2132

Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此