WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

Linhan Ma; Dake Guo; Kun Song; Yuepeng Jiang; Shuai Wang; Liumeng Xue; Weiming Xu; Huan Zhao; Binbin Zhang; Lei Xie

doi:10.21437/Interspeech.2024-2343

WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

Linhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei Xie

School of Computer Science

Research output: Contribution to journal › Conference article › peer-review

6 Scopus citations

Abstract

With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains 12, 800 hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALLE and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines on benchmark for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on huggingface.

Original language	English
Pages (from-to)	1840-1844
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs	https://doi.org/10.21437/Interspeech.2024-2343
State	Published - 2024
Event	25th Interspeech Conferece 2024 - Kos Island, Greece Duration: 1 Sep 2024 → 5 Sep 2024

Keywords

benchmark
corpus
text-to-speech

Access to Document

10.21437/Interspeech.2024-2343

Cite this

@article{75788145fc1d47b0bf6ef7c7d1efa286,

title = "WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark",

abstract = "With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains 12, 800 hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALLE and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines on benchmark for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on huggingface.",

keywords = "benchmark, corpus, text-to-speech",

author = "Linhan Ma and Dake Guo and Kun Song and Yuepeng Jiang and Shuai Wang and Liumeng Xue and Weiming Xu and Huan Zhao and Binbin Zhang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2024 International Speech Communication Association. All rights reserved.; 25th Interspeech Conferece 2024 ; Conference date: 01-09-2024 Through 05-09-2024",

year = "2024",

doi = "10.21437/Interspeech.2024-2343",

language = "英语",

pages = "1840--1844",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - WenetSpeech4TTS

T2 - 25th Interspeech Conferece 2024

AU - Ma, Linhan

AU - Guo, Dake

AU - Song, Kun

AU - Jiang, Yuepeng

AU - Wang, Shuai

AU - Xue, Liumeng

AU - Xu, Weiming

AU - Zhao, Huan

AU - Zhang, Binbin

AU - Xie, Lei

PY - 2024

Y1 - 2024

N2 - With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains 12, 800 hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALLE and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines on benchmark for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on huggingface.

AB - With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains 12, 800 hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALLE and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines on benchmark for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on huggingface.

KW - benchmark

KW - corpus

KW - text-to-speech

UR - http://www.scopus.com/inward/record.url?scp=85205696310&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2024-2343

DO - 10.21437/Interspeech.2024-2343

M3 - 会议文章

AN - SCOPUS:85205696310

SN - 2308-457X

SP - 1840

EP - 1844

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Y2 - 1 September 2024 through 5 September 2024

ER -

WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this