TY - GEN
T1 - Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge
AU - Chen, Hang
AU - Wu, Shilong
AU - Wang, Chenxi
AU - Du, Jun
AU - Lee, Chin Hui
AU - Siniscalchi, Sabato Marco
AU - Watanabe, Shinji
AU - Chen, Jingdong
AU - Scharenborg, Odette
AU - Wang, Zhong Qiu
AU - Yin, Bao Cai
AU - Pan, Jia
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Historically, MISP challenges have focused on audio-visual speech recognition (AVSR), where they have been particularly successful in complex acoustic scenarios. However, even the most sophisticated AVSR systems have been found to have performance limitations. Inspired by traditional robust speech recognition systems, where speech enhancement as a front-end can significantly improve accuracy, the MISP2023 challenge focused on audio-visual target speaker extraction (AVTSE). The primary goal of AVTSE is to enhance speech quality by exploiting the lip movements of the target speaker, thereby improving the final recognition performance. This paper provides a comprehensive overview of the challenge framework, describes the results, and summarizes the effective strategies employed by the contributions. In addition, we analyze the prevailing technical hurdles and provide recommendations for future directions to spur further progress in the AVTSE field.
AB - Historically, MISP challenges have focused on audio-visual speech recognition (AVSR), where they have been particularly successful in complex acoustic scenarios. However, even the most sophisticated AVSR systems have been found to have performance limitations. Inspired by traditional robust speech recognition systems, where speech enhancement as a front-end can significantly improve accuracy, the MISP2023 challenge focused on audio-visual target speaker extraction (AVTSE). The primary goal of AVTSE is to enhance speech quality by exploiting the lip movements of the target speaker, thereby improving the final recognition performance. This paper provides a comprehensive overview of the challenge framework, describes the results, and summarizes the effective strategies employed by the contributions. In addition, we analyze the prevailing technical hurdles and provide recommendations for future directions to spur further progress in the AVTSE field.
KW - MISP challenge
KW - audio-visual
KW - robust speech recognition
KW - target speaker extraction
UR - http://www.scopus.com/inward/record.url?scp=85202433590&partnerID=8YFLogxK
U2 - 10.1109/ICASSPW62465.2024.10627330
DO - 10.1109/ICASSPW62465.2024.10627330
M3 - 会议稿件
AN - SCOPUS:85202433590
T3 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings
SP - 123
EP - 124
BT - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024
Y2 - 14 April 2024 through 19 April 2024
ER -