TY - JOUR
T1 - Audio-Visual Speech Recognition in MISP2021 Challenge
T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022
AU - Chen, Hang
AU - Du, Jun
AU - Dai, Yusheng
AU - Lee, Chin Hui
AU - Siniscalchi, Sabato Marco
AU - Watanabe, Shinji
AU - Scharenborg, Odette
AU - Chen, Jingdong
AU - Yin, Bao Cai
AU - Pan, Jia
N1 - Publisher Copyright:
Copyright © 2022 ISCA.
PY - 2022
Y1 - 2022
N2 - In this paper, we present the updated Audio-Visual Speech Recognition (AVSR) corpus of MISP2021 challenge, a large-scale audio-visual Chinese conversational corpus consisting of 141h audio and video data collected by far/middle/near microphones and far/middle cameras in 34 real-home TV rooms. To our best knowledge, our corpus is the first distant multi-microphone conversational Chinese audio-visual corpus and the first large vocabulary continuous Chinese lip-reading dataset in the adverse home-tv scenario. Moreover, we make a deep analysis of the corpus and conduct a comprehensive ablation study of all audio and video data in the audio-only/video-only/audiovisual systems. Error analysis shows video modality supplement acoustic information degraded by noise to reduce deletion errors and provide discriminative information in overlapping speech to reduce substitution errors. Finally, we also design a set of experiments such as frontend, data augmentation and end-to-end models for providing the direction of potential future work. The corpus and the code are released to promote the research not only in speech area but also for the computer vision area and cross-disciplinary research.
AB - In this paper, we present the updated Audio-Visual Speech Recognition (AVSR) corpus of MISP2021 challenge, a large-scale audio-visual Chinese conversational corpus consisting of 141h audio and video data collected by far/middle/near microphones and far/middle cameras in 34 real-home TV rooms. To our best knowledge, our corpus is the first distant multi-microphone conversational Chinese audio-visual corpus and the first large vocabulary continuous Chinese lip-reading dataset in the adverse home-tv scenario. Moreover, we make a deep analysis of the corpus and conduct a comprehensive ablation study of all audio and video data in the audio-only/video-only/audiovisual systems. Error analysis shows video modality supplement acoustic information degraded by noise to reduce deletion errors and provide discriminative information in overlapping speech to reduce substitution errors. Finally, we also design a set of experiments such as frontend, data augmentation and end-to-end models for providing the direction of potential future work. The corpus and the code are released to promote the research not only in speech area but also for the computer vision area and cross-disciplinary research.
KW - Audio-visual
KW - data augmentation
KW - speech enhancement
KW - speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85140070841&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2022-10483
DO - 10.21437/Interspeech.2022-10483
M3 - 会议文章
AN - SCOPUS:85140070841
SN - 2308-457X
VL - 2022-September
SP - 1766
EP - 1770
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Y2 - 18 September 2022 through 22 September 2022
ER -