AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario

Yihui Fu; Luyao Cheng; Shubo Lv; Yukai Jv; Yuxiang Kong; Zhuo Chen; Yanxin Hu; Lei Xie; Jian Wu; Hui Bu; Xin Xu; Jun Du; Jingdong Chen

doi:10.21437/Interspeech.2021-1397

AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario

Yihui Fu, Luyao Cheng, Shubo Lv, Yukai Jv, Yuxiang Kong, Zhuo Chen, Yanxin Hu, Lei Xie, Jian Wu, Hui Bu, Xin Xu, Jun Du, Jingdong Chen

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

25 Scopus citations

Abstract

In this paper, we present AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks. Given most open source dataset for multi-speaker tasks are in English, AISHELL-4 is the only Mandarin dataset for conversation speech, providing additional value for data diversity in speech community. We also release a PyTorch-based training and evaluation framework as baseline system to promote reproducible research in this field.

Original language	English
Title of host publication	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Publisher	International Speech Communication Association
Pages	4406-4410
Number of pages	5
ISBN (Electronic)	9781713836902
DOIs	https://doi.org/10.21437/Interspeech.2021-1397
State	Published - 2021
Event	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 - Brno, Czech Republic Duration: 30 Aug 2021 → 3 Sep 2021

Publication series

Name	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	6
ISSN (Print)	2308-457X
ISSN (Electronic)	1990-9772

Conference

Conference	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Country/Territory	Czech Republic
City	Brno
Period	30/08/21 → 3/09/21

Keywords

AISHELL-4
Conference scenario
Mandarin
Speaker diarization
Speech front-end processing
Speech recognition

Access to Document

10.21437/Interspeech.2021-1397

Cite this

Fu, Y., Cheng, L., Lv, S., Jv, Y., Kong, Y., Chen, Z., Hu, Y., Xie, L., Wu, J., Bu, H., Xu, X., Du, J., & Chen, J. (2021). AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario. In 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 (pp. 4406-4410). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 6). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2021-1397

Fu, Yihui ; Cheng, Luyao ; Lv, Shubo et al. / AISHELL-4 : An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario. 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association, 2021. pp. 4406-4410 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH).

@inproceedings{ac0bb8f73a1c4f7ab39d9ad82e8e1157,

title = "AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario",

abstract = "In this paper, we present AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks. Given most open source dataset for multi-speaker tasks are in English, AISHELL-4 is the only Mandarin dataset for conversation speech, providing additional value for data diversity in speech community. We also release a PyTorch-based training and evaluation framework as baseline system to promote reproducible research in this field.",

keywords = "AISHELL-4, Conference scenario, Mandarin, Speaker diarization, Speech front-end processing, Speech recognition",

author = "Yihui Fu and Luyao Cheng and Shubo Lv and Yukai Jv and Yuxiang Kong and Zhuo Chen and Yanxin Hu and Lei Xie and Jian Wu and Hui Bu and Xin Xu and Jun Du and Jingdong Chen",

note = "Publisher Copyright: Copyright {\textcopyright} 2021 ISCA.; 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 ; Conference date: 30-08-2021 Through 03-09-2021",

year = "2021",

doi = "10.21437/Interspeech.2021-1397",

language = "英语",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

publisher = "International Speech Communication Association",

pages = "4406--4410",

booktitle = "22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021",

}

Fu, Y, Cheng, L, Lv, S, Jv, Y, Kong, Y, Chen, Z, Hu, Y, Xie, L, Wu, J, Bu, H, Xu, X, Du, J & Chen, J 2021, AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario. in 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 6, International Speech Communication Association, pp. 4406-4410, 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, Brno, Czech Republic, 30/08/21. https://doi.org/10.21437/Interspeech.2021-1397

AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario. / Fu, Yihui; Cheng, Luyao; Lv, Shubo et al.
22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association, 2021. p. 4406-4410 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 6).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - AISHELL-4

T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021

AU - Fu, Yihui

AU - Cheng, Luyao

AU - Lv, Shubo

AU - Jv, Yukai

AU - Kong, Yuxiang

AU - Chen, Zhuo

AU - Hu, Yanxin

AU - Xie, Lei

AU - Wu, Jian

AU - Bu, Hui

AU - Xu, Xin

AU - Du, Jun

AU - Chen, Jingdong

PY - 2021

Y1 - 2021

N2 - In this paper, we present AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks. Given most open source dataset for multi-speaker tasks are in English, AISHELL-4 is the only Mandarin dataset for conversation speech, providing additional value for data diversity in speech community. We also release a PyTorch-based training and evaluation framework as baseline system to promote reproducible research in this field.

AB - In this paper, we present AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks. Given most open source dataset for multi-speaker tasks are in English, AISHELL-4 is the only Mandarin dataset for conversation speech, providing additional value for data diversity in speech community. We also release a PyTorch-based training and evaluation framework as baseline system to promote reproducible research in this field.

KW - AISHELL-4

KW - Conference scenario

KW - Mandarin

KW - Speaker diarization

KW - Speech front-end processing

KW - Speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85119251247&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2021-1397

DO - 10.21437/Interspeech.2021-1397

M3 - 会议稿件

AN - SCOPUS:85119251247

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 4406

EP - 4410

BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021

PB - International Speech Communication Association

Y2 - 30 August 2021 through 3 September 2021

ER -

Fu Y, Cheng L, Lv S, Jv Y, Kong Y, Chen Z et al. AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario. In 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association. 2021. p. 4406-4410. (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH). doi: 10.21437/Interspeech.2021-1397

AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this