THE FIRST MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) CHALLENGE: DATA, TASKS, BASELINES AND RESULTS

Hang Chen; Hengshun Zhou; Jun Du; Chin Hui Lee; Jingdong Chen; Shinji Watanabe; Sabato Marco Siniscalchi; Odette Scharenborg; Di Yuan Liu; Bao Cai Yin; Jia Pan; Jian Qing Gao; Cong Liu

doi:10.1109/ICASSP43922.2022.9746683

THE FIRST MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) CHALLENGE: DATA, TASKS, BASELINES AND RESULTS

Hang Chen, Hengshun Zhou, Jun Du, Chin Hui Lee, Jingdong Chen, Shinji Watanabe, Sabato Marco Siniscalchi, Odette Scharenborg, Di Yuan Liu, Bao Cai Yin, Jia Pan, Jian Qing Gao, Cong Liu

School of Marine Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

46 Scopus citations

Abstract

In this paper we discuss the rational of the Multi-model Information based Speech Processing (MISP) Challenge, and provide a detailed description of the data recorded, the two evaluation tasks and the corresponding baselines, followed by a summary of submitted systems and evaluation results. The MISP Challenge aims at tackling speech processing tasks in different scenarios by introducing information about an additional modality (e.g., video, or text), which will hopefully lead to better environmental and speaker robustness in realistic applications. In the first MISP challenge, two benchmark datasets recorded in a real-home TV room with two reproducible open-source baseline systems have been released to promote research in audio-visual wake word spotting (AVWWS) and audio-visual speech recognition (AVSR). To our knowledge, MISP is the first open evaluation challenge to tackle real-world issues of AVWWS and AVSR in the home TV scenario.

Original language	English
Title of host publication	2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	9266-9270
Number of pages	5
ISBN (Electronic)	9781665405409
DOIs	https://doi.org/10.1109/ICASSP43922.2022.9746683
State	Published - 2022
Event	2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022 - Hybrid, Singapore Duration: 22 May 2022 → 27 May 2022

Publication series

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume	2022-May
ISSN (Print)	1520-6149

Conference

Conference	2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022
Country/Territory	Singapore
City	Hybrid
Period	22/05/22 → 27/05/22

Keywords

MISP challenge
audiovisual
automatic speech recognition
microphone array
wake word spotting

Access to Document

10.1109/ICASSP43922.2022.9746683

Cite this

Chen, H., Zhou, H., Du, J., Lee, C. H., Chen, J., Watanabe, S., Siniscalchi, S. M., Scharenborg, O., Liu, D. Y., Yin, B. C., Pan, J., Gao, J. Q., & Liu, C. (2022). THE FIRST MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) CHALLENGE: DATA, TASKS, BASELINES AND RESULTS. In 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings (pp. 9266-9270). (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2022-May). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP43922.2022.9746683

Chen, Hang ; Zhou, Hengshun ; Du, Jun et al. / THE FIRST MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) CHALLENGE : DATA, TASKS, BASELINES AND RESULTS. 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2022. pp. 9266-9270 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{3c0bef7ffbbc4406894b57d8b0c94912,

title = "THE FIRST MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) CHALLENGE: DATA, TASKS, BASELINES AND RESULTS",

abstract = "In this paper we discuss the rational of the Multi-model Information based Speech Processing (MISP) Challenge, and provide a detailed description of the data recorded, the two evaluation tasks and the corresponding baselines, followed by a summary of submitted systems and evaluation results. The MISP Challenge aims at tackling speech processing tasks in different scenarios by introducing information about an additional modality (e.g., video, or text), which will hopefully lead to better environmental and speaker robustness in realistic applications. In the first MISP challenge, two benchmark datasets recorded in a real-home TV room with two reproducible open-source baseline systems have been released to promote research in audio-visual wake word spotting (AVWWS) and audio-visual speech recognition (AVSR). To our knowledge, MISP is the first open evaluation challenge to tackle real-world issues of AVWWS and AVSR in the home TV scenario.",

keywords = "MISP challenge, audiovisual, automatic speech recognition, microphone array, wake word spotting",

author = "Hang Chen and Hengshun Zhou and Jun Du and Lee, {Chin Hui} and Jingdong Chen and Shinji Watanabe and Siniscalchi, {Sabato Marco} and Odette Scharenborg and Liu, {Di Yuan} and Yin, {Bao Cai} and Jia Pan and Gao, {Jian Qing} and Cong Liu",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE; 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022 ; Conference date: 22-05-2022 Through 27-05-2022",

year = "2022",

doi = "10.1109/ICASSP43922.2022.9746683",

language = "英语",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "9266--9270",

booktitle = "2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings",

}

Chen, H, Zhou, H, Du, J, Lee, CH, Chen, J, Watanabe, S, Siniscalchi, SM, Scharenborg, O, Liu, DY, Yin, BC, Pan, J, Gao, JQ & Liu, C 2022, THE FIRST MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) CHALLENGE: DATA, TASKS, BASELINES AND RESULTS. in 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2022-May, Institute of Electrical and Electronics Engineers Inc., pp. 9266-9270, 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Hybrid, Singapore, 22/05/22. https://doi.org/10.1109/ICASSP43922.2022.9746683

THE FIRST MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) CHALLENGE: DATA, TASKS, BASELINES AND RESULTS. / Chen, Hang; Zhou, Hengshun; Du, Jun et al.
2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2022. p. 9266-9270 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2022-May).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - THE FIRST MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) CHALLENGE

T2 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022

AU - Chen, Hang

AU - Zhou, Hengshun

AU - Du, Jun

AU - Lee, Chin Hui

AU - Chen, Jingdong

AU - Watanabe, Shinji

AU - Siniscalchi, Sabato Marco

AU - Scharenborg, Odette

AU - Liu, Di Yuan

AU - Yin, Bao Cai

AU - Pan, Jia

AU - Gao, Jian Qing

AU - Liu, Cong

PY - 2022

Y1 - 2022

N2 - In this paper we discuss the rational of the Multi-model Information based Speech Processing (MISP) Challenge, and provide a detailed description of the data recorded, the two evaluation tasks and the corresponding baselines, followed by a summary of submitted systems and evaluation results. The MISP Challenge aims at tackling speech processing tasks in different scenarios by introducing information about an additional modality (e.g., video, or text), which will hopefully lead to better environmental and speaker robustness in realistic applications. In the first MISP challenge, two benchmark datasets recorded in a real-home TV room with two reproducible open-source baseline systems have been released to promote research in audio-visual wake word spotting (AVWWS) and audio-visual speech recognition (AVSR). To our knowledge, MISP is the first open evaluation challenge to tackle real-world issues of AVWWS and AVSR in the home TV scenario.

AB - In this paper we discuss the rational of the Multi-model Information based Speech Processing (MISP) Challenge, and provide a detailed description of the data recorded, the two evaluation tasks and the corresponding baselines, followed by a summary of submitted systems and evaluation results. The MISP Challenge aims at tackling speech processing tasks in different scenarios by introducing information about an additional modality (e.g., video, or text), which will hopefully lead to better environmental and speaker robustness in realistic applications. In the first MISP challenge, two benchmark datasets recorded in a real-home TV room with two reproducible open-source baseline systems have been released to promote research in audio-visual wake word spotting (AVWWS) and audio-visual speech recognition (AVSR). To our knowledge, MISP is the first open evaluation challenge to tackle real-world issues of AVWWS and AVSR in the home TV scenario.

KW - MISP challenge

KW - audiovisual

KW - automatic speech recognition

KW - microphone array

KW - wake word spotting

UR - http://www.scopus.com/inward/record.url?scp=85129485765&partnerID=8YFLogxK

U2 - 10.1109/ICASSP43922.2022.9746683

DO - 10.1109/ICASSP43922.2022.9746683

M3 - 会议稿件

AN - SCOPUS:85129485765

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 9266

EP - 9270

BT - 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 22 May 2022 through 27 May 2022

ER -

Chen H, Zhou H, Du J, Lee CH, Chen J, Watanabe S et al. THE FIRST MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) CHALLENGE: DATA, TASKS, BASELINES AND RESULTS. In 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2022. p. 9266-9270. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP43922.2022.9746683

THE FIRST MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) CHALLENGE: DATA, TASKS, BASELINES AND RESULTS

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this