TY - GEN
T1 - WENETSPEECH
T2 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022
AU - Zhang, Binbin
AU - Lv, Hang
AU - Guo, Pengcheng
AU - Shao, Qijie
AU - Yang, Chao
AU - Xie, Lei
AU - Xu, Xin
AU - Bu, Hui
AU - Chen, Xiaoyu
AU - Zeng, Chenchen
AU - Wu, Di
AU - Peng, Zhendong
N1 - Publisher Copyright:
© 2022 IEEE
PY - 2022
Y1 - 2022
N2 - In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics and noisy conditions. An optical character recognition (OCR) method is introduced to generate the audio/text segmentation candidates for the YouTube data on the corresponding video subtitles, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation - Dev for cross-validation purpose in training, Test Net, collected from Internet for matched test, and Test Meeting, recorded from real meetings for more challenging mismatched test. Baseline systems trained with WenetSpeech are provided for three popular speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition results on the three test sets are also provided as benchmarks. To the best of our knowledge, WenetSpeech is the current largest open-source Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition.
AB - In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics and noisy conditions. An optical character recognition (OCR) method is introduced to generate the audio/text segmentation candidates for the YouTube data on the corresponding video subtitles, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation - Dev for cross-validation purpose in training, Test Net, collected from Internet for matched test, and Test Meeting, recorded from real meetings for more challenging mismatched test. Baseline systems trained with WenetSpeech are provided for three popular speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition results on the three test sets are also provided as benchmarks. To the best of our knowledge, WenetSpeech is the current largest open-source Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition.
KW - automatic speech recognition
KW - corpus
KW - multi-domain
UR - http://www.scopus.com/inward/record.url?scp=85128106924&partnerID=8YFLogxK
U2 - 10.1109/ICASSP43922.2022.9746682
DO - 10.1109/ICASSP43922.2022.9746682
M3 - 会议稿件
AN - SCOPUS:85128106924
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 6182
EP - 6186
BT - 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 22 May 2022 through 27 May 2022
ER -