TY - GEN
T1 - AISHELL-4
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
AU - Fu, Yihui
AU - Cheng, Luyao
AU - Lv, Shubo
AU - Jv, Yukai
AU - Kong, Yuxiang
AU - Chen, Zhuo
AU - Hu, Yanxin
AU - Xie, Lei
AU - Wu, Jian
AU - Bu, Hui
AU - Xu, Xin
AU - Du, Jun
AU - Chen, Jingdong
N1 - Publisher Copyright:
Copyright © 2021 ISCA.
PY - 2021
Y1 - 2021
N2 - In this paper, we present AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks. Given most open source dataset for multi-speaker tasks are in English, AISHELL-4 is the only Mandarin dataset for conversation speech, providing additional value for data diversity in speech community. We also release a PyTorch-based training and evaluation framework as baseline system to promote reproducible research in this field.
AB - In this paper, we present AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks. Given most open source dataset for multi-speaker tasks are in English, AISHELL-4 is the only Mandarin dataset for conversation speech, providing additional value for data diversity in speech community. We also release a PyTorch-based training and evaluation framework as baseline system to promote reproducible research in this field.
KW - AISHELL-4
KW - Conference scenario
KW - Mandarin
KW - Speaker diarization
KW - Speech front-end processing
KW - Speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85119251247&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-1397
DO - 10.21437/Interspeech.2021-1397
M3 - 会议稿件
AN - SCOPUS:85119251247
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 4406
EP - 4410
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
Y2 - 30 August 2021 through 3 September 2021
ER -