Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

Hongqiang Du; Lei Xie

doi:10.21437/Interspeech.2021-2132

Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

Hongqiang Du, Lei Xie

School of Computer Science

Northwestern Polytechnical University Xian

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training. However, available one-shot voice conversion approaches are not stable for unseen speakers as the speaker embedding extracted from one utterance of an unseen speaker is not reliable. In this paper, we propose a deep discriminative speaker encoder to extract speaker embedding from one utterance more effectively. Specifically, the speaker encoder first integrates residual network and squeeze-and-excitation network to extract discriminative speaker information in frame level by modeling framewise and channel-wise interdependence in features. Then attention mechanism is introduced to further emphasize speaker related information via assigning different weights to frame level speaker information. Finally a statistic pooling layer is used to aggregate weighted frame level speaker information to form utterance level speaker embedding. The experimental results demonstrate that our proposed speaker encoder can improve the robustness of one-shot voice conversion for unseen speakers and outperforms baseline systems in terms of speech quality and speaker similarity.

Original language	English
Title of host publication	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Publisher	International Speech Communication Association
Pages	4725-4729
Number of pages	5
ISBN (Electronic)	9781713836902
DOIs	https://doi.org/10.21437/Interspeech.2021-2132
State	Published - 2021
Event	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 - Brno, Czech Republic Duration: 30 Aug 2021 → 3 Sep 2021

Publication series

Name	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	6
ISSN (Print)	2308-457X
ISSN (Electronic)	1990-9772

Conference

Conference	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Country/Territory	Czech Republic
City	Brno
Period	30/08/21 → 3/09/21

Keywords

One-shot
Speaker embedding
Voice conversion

Access to Document

10.21437/Interspeech.2021-2132

Cite this

Du, H., & Xie, L. (2021). Improving robustness of one-shot voice conversion with deep discriminative speaker encoder. In 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 (pp. 4725-4729). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 6). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2021-2132

Du, Hongqiang ; Xie, Lei. / Improving robustness of one-shot voice conversion with deep discriminative speaker encoder. 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association, 2021. pp. 4725-4729 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH).

@inproceedings{2d177e668bae4e0c9801a40ad9184bf4,

title = "Improving robustness of one-shot voice conversion with deep discriminative speaker encoder",

abstract = "One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training. However, available one-shot voice conversion approaches are not stable for unseen speakers as the speaker embedding extracted from one utterance of an unseen speaker is not reliable. In this paper, we propose a deep discriminative speaker encoder to extract speaker embedding from one utterance more effectively. Specifically, the speaker encoder first integrates residual network and squeeze-and-excitation network to extract discriminative speaker information in frame level by modeling framewise and channel-wise interdependence in features. Then attention mechanism is introduced to further emphasize speaker related information via assigning different weights to frame level speaker information. Finally a statistic pooling layer is used to aggregate weighted frame level speaker information to form utterance level speaker embedding. The experimental results demonstrate that our proposed speaker encoder can improve the robustness of one-shot voice conversion for unseen speakers and outperforms baseline systems in terms of speech quality and speaker similarity.",

keywords = "One-shot, Speaker embedding, Voice conversion",

author = "Hongqiang Du and Lei Xie",

note = "Publisher Copyright: Copyright {\textcopyright} 2021 ISCA.; 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 ; Conference date: 30-08-2021 Through 03-09-2021",

year = "2021",

doi = "10.21437/Interspeech.2021-2132",

language = "英语",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

publisher = "International Speech Communication Association",

pages = "4725--4729",

booktitle = "22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021",

}

Du, H & Xie, L 2021, Improving robustness of one-shot voice conversion with deep discriminative speaker encoder. in 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 6, International Speech Communication Association, pp. 4725-4729, 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, Brno, Czech Republic, 30/08/21. https://doi.org/10.21437/Interspeech.2021-2132

Improving robustness of one-shot voice conversion with deep discriminative speaker encoder. / Du, Hongqiang; Xie, Lei.
22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association, 2021. p. 4725-4729 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 6).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

AU - Du, Hongqiang

AU - Xie, Lei

PY - 2021

Y1 - 2021

N2 - One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training. However, available one-shot voice conversion approaches are not stable for unseen speakers as the speaker embedding extracted from one utterance of an unseen speaker is not reliable. In this paper, we propose a deep discriminative speaker encoder to extract speaker embedding from one utterance more effectively. Specifically, the speaker encoder first integrates residual network and squeeze-and-excitation network to extract discriminative speaker information in frame level by modeling framewise and channel-wise interdependence in features. Then attention mechanism is introduced to further emphasize speaker related information via assigning different weights to frame level speaker information. Finally a statistic pooling layer is used to aggregate weighted frame level speaker information to form utterance level speaker embedding. The experimental results demonstrate that our proposed speaker encoder can improve the robustness of one-shot voice conversion for unseen speakers and outperforms baseline systems in terms of speech quality and speaker similarity.

AB - One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training. However, available one-shot voice conversion approaches are not stable for unseen speakers as the speaker embedding extracted from one utterance of an unseen speaker is not reliable. In this paper, we propose a deep discriminative speaker encoder to extract speaker embedding from one utterance more effectively. Specifically, the speaker encoder first integrates residual network and squeeze-and-excitation network to extract discriminative speaker information in frame level by modeling framewise and channel-wise interdependence in features. Then attention mechanism is introduced to further emphasize speaker related information via assigning different weights to frame level speaker information. Finally a statistic pooling layer is used to aggregate weighted frame level speaker information to form utterance level speaker embedding. The experimental results demonstrate that our proposed speaker encoder can improve the robustness of one-shot voice conversion for unseen speakers and outperforms baseline systems in terms of speech quality and speaker similarity.

KW - One-shot

KW - Speaker embedding

KW - Voice conversion

UR - http://www.scopus.com/inward/record.url?scp=85119301023&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2021-2132

DO - 10.21437/Interspeech.2021-2132

M3 - 会议稿件

AN - SCOPUS:85119301023

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 4725

EP - 4729

BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021

PB - International Speech Communication Association

T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021

Y2 - 30 August 2021 through 3 September 2021

ER -

Du H, Xie L. Improving robustness of one-shot voice conversion with deep discriminative speaker encoder. In 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association. 2021. p. 4725-4729. (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH). doi: 10.21437/Interspeech.2021-2132

Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this