The multi-speaker multi-style voice cloning challenge 2021

Qicong Xie; Xiaohai Tian; Guanghou Liu; Kun Song; Lei Xie; Zhiyong Wu; Hai Li; Song Shi; Haizhou Li; Fen Hong; Hui Bu; Xin Xu

doi:10.1109/ICASSP39728.2021.9414001

The multi-speaker multi-style voice cloning challenge 2021

Qicong Xie, Xiaohai Tian, Guanghou Liu, Kun Song, Lei Xie, Zhiyong Wu, Hai Li, Song Shi, Haizhou Li, Fen Hong, Hui Bu, Xin Xu

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

25 Scopus citations

Abstract

The Multi-speaker Multi-style Voice Cloning Challenge (M2VoC) aims to provide a common sizable dataset as well as a fair testbed for the benchmarking of the popular voice cloning task. Specifically, we formulate the challenge to adapt an average TTS model to the stylistic target voice with limited data from target speaker, evaluated by speaker identity and style similarity. The challenge consists of two tracks, namely few-shot track and one-shot track, where the participants are required to clone multiple target voices with 100 and 5 samples respectively. There are also two sub-tracks in each track. For sub-track a, to fairly compare different strategies, the participants are allowed to use only the training data provided by the organizer strictly. For sub-track b, the participants are allowed to use any data publicly available. In this paper, we present a detailed explanation on the tasks and data used in the challenge, followed by a summary of submitted systems and evaluation results.

Original language	English
Title of host publication	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	8613-8617
Number of pages	5
ISBN (Electronic)	9781728176055
DOIs	https://doi.org/10.1109/ICASSP39728.2021.9414001
State	Published - 2021
Event	2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021 - Virtual, Toronto, Canada Duration: 6 Jun 2021 → 11 Jun 2021

Publication series

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume	2021-June
ISSN (Print)	1520-6149

Conference

Conference	2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021
Country/Territory	Canada
City	Virtual, Toronto
Period	6/06/21 → 11/06/21

Keywords

Speaker adaption
Speech synthesis
Transfer learning
Voice cloning

Access to Document

10.1109/ICASSP39728.2021.9414001

Cite this

Xie, Q., Tian, X., Liu, G., Song, K., Xie, L., Wu, Z., Li, H., Shi, S., Li, H., Hong, F., Bu, H., & Xu, X. (2021). The multi-speaker multi-style voice cloning challenge 2021. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings (pp. 8613-8617). (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2021-June). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP39728.2021.9414001

Xie, Qicong ; Tian, Xiaohai ; Liu, Guanghou et al. / The multi-speaker multi-style voice cloning challenge 2021. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2021. pp. 8613-8617 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{1b9b07485e594e5f85484b9e68f722c3,

title = "The multi-speaker multi-style voice cloning challenge 2021",

abstract = "The Multi-speaker Multi-style Voice Cloning Challenge (M2VoC) aims to provide a common sizable dataset as well as a fair testbed for the benchmarking of the popular voice cloning task. Specifically, we formulate the challenge to adapt an average TTS model to the stylistic target voice with limited data from target speaker, evaluated by speaker identity and style similarity. The challenge consists of two tracks, namely few-shot track and one-shot track, where the participants are required to clone multiple target voices with 100 and 5 samples respectively. There are also two sub-tracks in each track. For sub-track a, to fairly compare different strategies, the participants are allowed to use only the training data provided by the organizer strictly. For sub-track b, the participants are allowed to use any data publicly available. In this paper, we present a detailed explanation on the tasks and data used in the challenge, followed by a summary of submitted systems and evaluation results.",

keywords = "Speaker adaption, Speech synthesis, Transfer learning, Voice cloning",

author = "Qicong Xie and Xiaohai Tian and Guanghou Liu and Kun Song and Lei Xie and Zhiyong Wu and Hai Li and Song Shi and Haizhou Li and Fen Hong and Hui Bu and Xin Xu",

note = "Publisher Copyright: {\textcopyright} 2021 IEEE; 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021 ; Conference date: 06-06-2021 Through 11-06-2021",

year = "2021",

doi = "10.1109/ICASSP39728.2021.9414001",

language = "英语",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "8613--8617",

booktitle = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

}

Xie, Q, Tian, X, Liu, G, Song, K, Xie, L, Wu, Z, Li, H, Shi, S, Li, H, Hong, F, Bu, H & Xu, X 2021, The multi-speaker multi-style voice cloning challenge 2021. in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2021-June, Institute of Electrical and Electronics Engineers Inc., pp. 8613-8617, 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021, Virtual, Toronto, Canada, 6/06/21. https://doi.org/10.1109/ICASSP39728.2021.9414001

The multi-speaker multi-style voice cloning challenge 2021. / Xie, Qicong; Tian, Xiaohai; Liu, Guanghou et al.
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2021. p. 8613-8617 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2021-June).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - The multi-speaker multi-style voice cloning challenge 2021

AU - Xie, Qicong

AU - Tian, Xiaohai

AU - Liu, Guanghou

AU - Song, Kun

AU - Xie, Lei

AU - Wu, Zhiyong

AU - Li, Hai

AU - Shi, Song

AU - Li, Haizhou

AU - Hong, Fen

AU - Bu, Hui

AU - Xu, Xin

PY - 2021

Y1 - 2021

N2 - The Multi-speaker Multi-style Voice Cloning Challenge (M2VoC) aims to provide a common sizable dataset as well as a fair testbed for the benchmarking of the popular voice cloning task. Specifically, we formulate the challenge to adapt an average TTS model to the stylistic target voice with limited data from target speaker, evaluated by speaker identity and style similarity. The challenge consists of two tracks, namely few-shot track and one-shot track, where the participants are required to clone multiple target voices with 100 and 5 samples respectively. There are also two sub-tracks in each track. For sub-track a, to fairly compare different strategies, the participants are allowed to use only the training data provided by the organizer strictly. For sub-track b, the participants are allowed to use any data publicly available. In this paper, we present a detailed explanation on the tasks and data used in the challenge, followed by a summary of submitted systems and evaluation results.

AB - The Multi-speaker Multi-style Voice Cloning Challenge (M2VoC) aims to provide a common sizable dataset as well as a fair testbed for the benchmarking of the popular voice cloning task. Specifically, we formulate the challenge to adapt an average TTS model to the stylistic target voice with limited data from target speaker, evaluated by speaker identity and style similarity. The challenge consists of two tracks, namely few-shot track and one-shot track, where the participants are required to clone multiple target voices with 100 and 5 samples respectively. There are also two sub-tracks in each track. For sub-track a, to fairly compare different strategies, the participants are allowed to use only the training data provided by the organizer strictly. For sub-track b, the participants are allowed to use any data publicly available. In this paper, we present a detailed explanation on the tasks and data used in the challenge, followed by a summary of submitted systems and evaluation results.

KW - Speaker adaption

KW - Speech synthesis

KW - Transfer learning

KW - Voice cloning

UR - http://www.scopus.com/inward/record.url?scp=85108665338&partnerID=8YFLogxK

U2 - 10.1109/ICASSP39728.2021.9414001

DO - 10.1109/ICASSP39728.2021.9414001

M3 - 会议稿件

AN - SCOPUS:85108665338

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 8613

EP - 8617

BT - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021

Y2 - 6 June 2021 through 11 June 2021

ER -

Xie Q, Tian X, Liu G, Song K, Xie L, Wu Z et al. The multi-speaker multi-style voice cloning challenge 2021. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2021. p. 8613-8617. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP39728.2021.9414001

The multi-speaker multi-style voice cloning challenge 2021

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this