WENETSPEECH: A 10000+ HOURS MULTI-DOMAIN MANDARIN CORPUS FOR SPEECH RECOGNITION

Binbin Zhang; Hang Lv; Pengcheng Guo; Qijie Shao; Chao Yang; Lei Xie; Xin Xu; Hui Bu; Xiaoyu Chen; Chenchen Zeng; Di Wu; Zhendong Peng

doi:10.1109/ICASSP43922.2022.9746682

WENETSPEECH: A 10000+ HOURS MULTI-DOMAIN MANDARIN CORPUS FOR SPEECH RECOGNITION

Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, Zhendong Peng

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

151 引用（Scopus）

摘要

In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics and noisy conditions. An optical character recognition (OCR) method is introduced to generate the audio/text segmentation candidates for the YouTube data on the corresponding video subtitles, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation - Dev for cross-validation purpose in training, Test Net, collected from Internet for matched test, and Test Meeting, recorded from real meetings for more challenging mismatched test. Baseline systems trained with WenetSpeech are provided for three popular speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition results on the three test sets are also provided as benchmarks. To the best of our knowledge, WenetSpeech is the current largest open-source Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition.

源语言	英语
主期刊名	2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
出版商	Institute of Electrical and Electronics Engineers Inc.
页	6182-6186
页数	5
ISBN（电子版）	9781665405409
DOI	https://doi.org/10.1109/ICASSP43922.2022.9746682
出版状态	已出版 - 2022
活动	2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022 - Hybrid, 新加坡期限: 22 5月 2022 → 27 5月 2022

出版系列

姓名	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
卷	2022-May
ISSN（印刷版）	1520-6149

会议

会议	2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022
国家/地区	新加坡
市	Hybrid
时期	22/05/22 → 27/05/22

访问文件

10.1109/ICASSP43922.2022.9746682

其它文件与链接

链接到 Scopus 的出版物

引用此

Zhang, B., Lv, H., Guo, P., Shao, Q., Yang, C., Xie, L., Xu, X., Bu, H., Chen, X., Zeng, C., Wu, D., & Peng, Z. (2022). WENETSPEECH: A 10000+ HOURS MULTI-DOMAIN MANDARIN CORPUS FOR SPEECH RECOGNITION. 在 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings (页码 6182-6186). (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; 卷 2022-May). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP43922.2022.9746682

Zhang, Binbin ; Lv, Hang ; Guo, Pengcheng 等. / WENETSPEECH : A 10000+ HOURS MULTI-DOMAIN MANDARIN CORPUS FOR SPEECH RECOGNITION. 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2022. 页码 6182-6186 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{6a717ff95df64f93bbfcc359c39bb083,

title = "WENETSPEECH: A 10000+ HOURS MULTI-DOMAIN MANDARIN CORPUS FOR SPEECH RECOGNITION",

abstract = "In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics and noisy conditions. An optical character recognition (OCR) method is introduced to generate the audio/text segmentation candidates for the YouTube data on the corresponding video subtitles, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation - Dev for cross-validation purpose in training, Test Net, collected from Internet for matched test, and Test Meeting, recorded from real meetings for more challenging mismatched test. Baseline systems trained with WenetSpeech are provided for three popular speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition results on the three test sets are also provided as benchmarks. To the best of our knowledge, WenetSpeech is the current largest open-source Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition.",

keywords = "automatic speech recognition, corpus, multi-domain",

author = "Binbin Zhang and Hang Lv and Pengcheng Guo and Qijie Shao and Chao Yang and Lei Xie and Xin Xu and Hui Bu and Xiaoyu Chen and Chenchen Zeng and Di Wu and Zhendong Peng",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE; 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022 ; Conference date: 22-05-2022 Through 27-05-2022",

year = "2022",

doi = "10.1109/ICASSP43922.2022.9746682",

language = "英语",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "6182--6186",

booktitle = "2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings",

}

Zhang, B, Lv, H, Guo, P, Shao, Q, Yang, C, Xie, L, Xu, X, Bu, H, Chen, X, Zeng, C, Wu, D & Peng, Z 2022, WENETSPEECH: A 10000+ HOURS MULTI-DOMAIN MANDARIN CORPUS FOR SPEECH RECOGNITION. 在 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 卷 2022-May, Institute of Electrical and Electronics Engineers Inc., 页码 6182-6186, 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Hybrid, 新加坡, 22/05/22. https://doi.org/10.1109/ICASSP43922.2022.9746682

WENETSPEECH: A 10000+ HOURS MULTI-DOMAIN MANDARIN CORPUS FOR SPEECH RECOGNITION. / Zhang, Binbin; Lv, Hang; Guo, Pengcheng 等.
2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2022. 页码 6182-6186 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; 卷 2022-May).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - WENETSPEECH

T2 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022

AU - Zhang, Binbin

AU - Lv, Hang

AU - Guo, Pengcheng

AU - Shao, Qijie

AU - Yang, Chao

AU - Xie, Lei

AU - Xu, Xin

AU - Bu, Hui

AU - Chen, Xiaoyu

AU - Zeng, Chenchen

AU - Wu, Di

AU - Peng, Zhendong

PY - 2022

Y1 - 2022

N2 - In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics and noisy conditions. An optical character recognition (OCR) method is introduced to generate the audio/text segmentation candidates for the YouTube data on the corresponding video subtitles, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation - Dev for cross-validation purpose in training, Test Net, collected from Internet for matched test, and Test Meeting, recorded from real meetings for more challenging mismatched test. Baseline systems trained with WenetSpeech are provided for three popular speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition results on the three test sets are also provided as benchmarks. To the best of our knowledge, WenetSpeech is the current largest open-source Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition.

AB - In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics and noisy conditions. An optical character recognition (OCR) method is introduced to generate the audio/text segmentation candidates for the YouTube data on the corresponding video subtitles, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation - Dev for cross-validation purpose in training, Test Net, collected from Internet for matched test, and Test Meeting, recorded from real meetings for more challenging mismatched test. Baseline systems trained with WenetSpeech are provided for three popular speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition results on the three test sets are also provided as benchmarks. To the best of our knowledge, WenetSpeech is the current largest open-source Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition.

KW - automatic speech recognition

KW - corpus

KW - multi-domain

UR - http://www.scopus.com/inward/record.url?scp=85128106924&partnerID=8YFLogxK

U2 - 10.1109/ICASSP43922.2022.9746682

DO - 10.1109/ICASSP43922.2022.9746682

M3 - 会议稿件

AN - SCOPUS:85128106924

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 6182

EP - 6186

BT - 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 22 May 2022 through 27 May 2022

ER -

Zhang B, Lv H, Guo P, Shao Q, Yang C, Xie L 等. WENETSPEECH: A 10000+ HOURS MULTI-DOMAIN MANDARIN CORPUS FOR SPEECH RECOGNITION. 在 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2022. 页码 6182-6186. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP43922.2022.9746682

WENETSPEECH: A 10000+ HOURS MULTI-DOMAIN MANDARIN CORPUS FOR SPEECH RECOGNITION

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此