TY - GEN
T1 - THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE
T2 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024
AU - Wu, Shilong
AU - Wang, Chenxi
AU - Chen, Hang
AU - Dai, Yusheng
AU - Zhang, Chenyue
AU - Wang, Ruoyu
AU - Lan, Hongbo
AU - Du, Jun
AU - Lee, Chin Hui
AU - Chen, Jingdong
AU - Siniscalchi, Sabato Marco
AU - Scharenborg, Odette
AU - Wang, Zhong Qiu
AU - Pan, Jia
AU - Gao, Jianqing
N1 - Publisher Copyright:
©2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhancement challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the accuracy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward.
AB - Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhancement challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the accuracy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward.
KW - MISP challenge
KW - multimodality
KW - real-world scenarios
KW - target speaker extraction
UR - http://www.scopus.com/inward/record.url?scp=85188695414&partnerID=8YFLogxK
U2 - 10.1109/ICASSP48485.2024.10447462
DO - 10.1109/ICASSP48485.2024.10447462
M3 - 会议稿件
AN - SCOPUS:85188695414
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 8351
EP - 8355
BT - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 14 April 2024 through 19 April 2024
ER -