The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition

Zhe Wang; Shilong Wu; Hang Chen; Mao Kui He; Jun Du; Chin Hui Lee; Jingdong Chen; Shinji Watanabe; Sabato Siniscalchi; Odette Scharenborg; Diyuan Liu; Baocai Yin; Jia Pan; Jianqing Gao; Cong Liu

doi:10.1109/ICASSP49357.2023.10094836

The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition

Zhe Wang, Shilong Wu, Hang Chen, Mao Kui He, Jun Du, Chin Hui Lee, Jingdong Chen, Shinji Watanabe, Sabato Siniscalchi, Odette Scharenborg, Diyuan Liu, Baocai Yin, Jia Pan, Jianqing Gao, Cong Liu

航海学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

11 引用（Scopus）

摘要

The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve "who spoken when"using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing "who spoken what when"with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers.

源语言	英语
主期刊名	ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings
出版商	Institute of Electrical and Electronics Engineers Inc.
ISBN（电子版）	9781728163277
DOI	https://doi.org/10.1109/ICASSP49357.2023.10094836
出版状态	已出版 - 2023
活动	48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 - Rhodes Island, 希腊期限: 4 6月 2023 → 10 6月 2023

出版系列

姓名	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
卷	2023-June
ISSN（印刷版）	1520-6149

会议

会议	48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023
国家/地区	希腊
市	Rhodes Island
时期	4/06/23 → 10/06/23

访问文件

10.1109/ICASSP49357.2023.10094836

其它文件与链接

链接到 Scopus 的出版物

引用此

Wang, Z., Wu, S., Chen, H., He, M. K., Du, J., Lee, C. H., Chen, J., Watanabe, S., Siniscalchi, S., Scharenborg, O., Liu, D., Yin, B., Pan, J., Gao, J., & Liu, C. (2023). The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition. 在 ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; 卷 2023-June). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP49357.2023.10094836

Wang, Zhe ; Wu, Shilong ; Chen, Hang 等. / The Multimodal Information Based Speech Processing (Misp) 2022 Challenge : Audio-Visual Diarization And Recognition. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings. Institute of Electrical and Electronics Engineers Inc., 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{5b05d0d2ae2049fda75382afa7950c42,

title = "The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition",

abstract = "The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve {"}who spoken when{"}using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing {"}who spoken what when{"}with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers.",

keywords = "MISP challenge, multimodality, speaker diarization, speech recognition",

author = "Zhe Wang and Shilong Wu and Hang Chen and He, {Mao Kui} and Jun Du and Lee, {Chin Hui} and Jingdong Chen and Shinji Watanabe and Sabato Siniscalchi and Odette Scharenborg and Diyuan Liu and Baocai Yin and Jia Pan and Jianqing Gao and Cong Liu",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 ; Conference date: 04-06-2023 Through 10-06-2023",

year = "2023",

doi = "10.1109/ICASSP49357.2023.10094836",

language = "英语",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings",

}

Wang, Z, Wu, S, Chen, H, He, MK, Du, J, Lee, CH, Chen, J, Watanabe, S, Siniscalchi, S, Scharenborg, O, Liu, D, Yin, B, Pan, J, Gao, J & Liu, C 2023, The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition. 在 ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 卷 2023-June, Institute of Electrical and Electronics Engineers Inc., 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023, Rhodes Island, 希腊, 4/06/23. https://doi.org/10.1109/ICASSP49357.2023.10094836

The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition. / Wang, Zhe; Wu, Shilong; Chen, Hang 等.
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings. Institute of Electrical and Electronics Engineers Inc., 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; 卷 2023-June).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - The Multimodal Information Based Speech Processing (Misp) 2022 Challenge

T2 - 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

AU - Wang, Zhe

AU - Wu, Shilong

AU - Chen, Hang

AU - He, Mao Kui

AU - Du, Jun

AU - Lee, Chin Hui

AU - Chen, Jingdong

AU - Watanabe, Shinji

AU - Siniscalchi, Sabato

AU - Scharenborg, Odette

AU - Liu, Diyuan

AU - Yin, Baocai

AU - Pan, Jia

AU - Gao, Jianqing

AU - Liu, Cong

PY - 2023

Y1 - 2023

N2 - The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve "who spoken when"using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing "who spoken what when"with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers.

AB - The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve "who spoken when"using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing "who spoken what when"with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers.

KW - MISP challenge

KW - multimodality

KW - speaker diarization

KW - speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85177603071&partnerID=8YFLogxK

U2 - 10.1109/ICASSP49357.2023.10094836

DO - 10.1109/ICASSP49357.2023.10094836

M3 - 会议稿件

AN - SCOPUS:85177603071

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

BT - ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 4 June 2023 through 10 June 2023

ER -

Wang Z, Wu S, Chen H, He MK, Du J, Lee CH 等. The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition. 在 ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings. Institute of Electrical and Electronics Engineers Inc. 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP49357.2023.10094836

The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此