Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge

Hang Chen; Shilong Wu; Chenxi Wang; Jun Du; Chin Hui Lee; Sabato Marco Siniscalchi; Shinji Watanabe; Jingdong Chen; Odette Scharenborg; Zhong Qiu Wang; Bao Cai Yin; Jia Pan

doi:10.1109/ICASSPW62465.2024.10627330

Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge

Hang Chen, Shilong Wu, Chenxi Wang, Jun Du, Chin Hui Lee, Sabato Marco Siniscalchi, Shinji Watanabe, Jingdong Chen, Odette Scharenborg, Zhong Qiu Wang, Bao Cai Yin, Jia Pan

航海学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

摘要

Historically, MISP challenges have focused on audio-visual speech recognition (AVSR), where they have been particularly successful in complex acoustic scenarios. However, even the most sophisticated AVSR systems have been found to have performance limitations. Inspired by traditional robust speech recognition systems, where speech enhancement as a front-end can significantly improve accuracy, the MISP2023 challenge focused on audio-visual target speaker extraction (AVTSE). The primary goal of AVTSE is to enhance speech quality by exploiting the lip movements of the target speaker, thereby improving the final recognition performance. This paper provides a comprehensive overview of the challenge framework, describes the results, and summarizes the effective strategies employed by the contributions. In addition, we analyze the prevailing technical hurdles and provide recommendations for future directions to spur further progress in the AVTSE field.

源语言	英语
主期刊名	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings
出版商	Institute of Electrical and Electronics Engineers Inc.
页	123-124
页数	2
ISBN（电子版）	9798350374513
DOI	https://doi.org/10.1109/ICASSPW62465.2024.10627330
出版状态	已出版 - 2024
活动	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Seoul, 韩国期限: 14 4月 2024 → 19 4月 2024

出版系列

姓名	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings

会议

会议	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024
国家/地区	韩国
市	Seoul
时期	14/04/24 → 19/04/24

访问文件

10.1109/ICASSPW62465.2024.10627330

其它文件与链接

链接到 Scopus 的出版物

引用此

Chen, H., Wu, S., Wang, C., Du, J., Lee, C. H., Siniscalchi, S. M., Watanabe, S., Chen, J., Scharenborg, O., Wang, Z. Q., Yin, B. C., & Pan, J. (2024). Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge. 在 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings (页码 123-124). (2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSPW62465.2024.10627330

Chen, Hang ; Wu, Shilong ; Wang, Chenxi 等. / Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge. 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2024. 页码 123-124 (2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings).

@inproceedings{36567dec624549fa8c735c0869a89195,

title = "Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge",

abstract = "Historically, MISP challenges have focused on audio-visual speech recognition (AVSR), where they have been particularly successful in complex acoustic scenarios. However, even the most sophisticated AVSR systems have been found to have performance limitations. Inspired by traditional robust speech recognition systems, where speech enhancement as a front-end can significantly improve accuracy, the MISP2023 challenge focused on audio-visual target speaker extraction (AVTSE). The primary goal of AVTSE is to enhance speech quality by exploiting the lip movements of the target speaker, thereby improving the final recognition performance. This paper provides a comprehensive overview of the challenge framework, describes the results, and summarizes the effective strategies employed by the contributions. In addition, we analyze the prevailing technical hurdles and provide recommendations for future directions to spur further progress in the AVTSE field.",

keywords = "MISP challenge, audio-visual, robust speech recognition, target speaker extraction",

author = "Hang Chen and Shilong Wu and Chenxi Wang and Jun Du and Lee, {Chin Hui} and Siniscalchi, {Sabato Marco} and Shinji Watanabe and Jingdong Chen and Odette Scharenborg and Wang, {Zhong Qiu} and Yin, {Bao Cai} and Jia Pan",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 ; Conference date: 14-04-2024 Through 19-04-2024",

year = "2024",

doi = "10.1109/ICASSPW62465.2024.10627330",

language = "英语",

series = "2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "123--124",

booktitle = "2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings",

}

Chen, H, Wu, S, Wang, C, Du, J, Lee, CH, Siniscalchi, SM, Watanabe, S, Chen, J, Scharenborg, O, Wang, ZQ, Yin, BC & Pan, J 2024, Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge. 在 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings. 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings, Institute of Electrical and Electronics Engineers Inc., 页码 123-124, 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024, Seoul, 韩国, 14/04/24. https://doi.org/10.1109/ICASSPW62465.2024.10627330

Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge. / Chen, Hang; Wu, Shilong; Wang, Chenxi 等.
2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2024. 页码 123-124 (2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge

AU - Chen, Hang

AU - Wu, Shilong

AU - Wang, Chenxi

AU - Du, Jun

AU - Lee, Chin Hui

AU - Siniscalchi, Sabato Marco

AU - Watanabe, Shinji

AU - Chen, Jingdong

AU - Scharenborg, Odette

AU - Wang, Zhong Qiu

AU - Yin, Bao Cai

AU - Pan, Jia

PY - 2024

Y1 - 2024

N2 - Historically, MISP challenges have focused on audio-visual speech recognition (AVSR), where they have been particularly successful in complex acoustic scenarios. However, even the most sophisticated AVSR systems have been found to have performance limitations. Inspired by traditional robust speech recognition systems, where speech enhancement as a front-end can significantly improve accuracy, the MISP2023 challenge focused on audio-visual target speaker extraction (AVTSE). The primary goal of AVTSE is to enhance speech quality by exploiting the lip movements of the target speaker, thereby improving the final recognition performance. This paper provides a comprehensive overview of the challenge framework, describes the results, and summarizes the effective strategies employed by the contributions. In addition, we analyze the prevailing technical hurdles and provide recommendations for future directions to spur further progress in the AVTSE field.

AB - Historically, MISP challenges have focused on audio-visual speech recognition (AVSR), where they have been particularly successful in complex acoustic scenarios. However, even the most sophisticated AVSR systems have been found to have performance limitations. Inspired by traditional robust speech recognition systems, where speech enhancement as a front-end can significantly improve accuracy, the MISP2023 challenge focused on audio-visual target speaker extraction (AVTSE). The primary goal of AVTSE is to enhance speech quality by exploiting the lip movements of the target speaker, thereby improving the final recognition performance. This paper provides a comprehensive overview of the challenge framework, describes the results, and summarizes the effective strategies employed by the contributions. In addition, we analyze the prevailing technical hurdles and provide recommendations for future directions to spur further progress in the AVTSE field.

KW - MISP challenge

KW - audio-visual

KW - robust speech recognition

KW - target speaker extraction

UR - http://www.scopus.com/inward/record.url?scp=85202433590&partnerID=8YFLogxK

U2 - 10.1109/ICASSPW62465.2024.10627330

DO - 10.1109/ICASSPW62465.2024.10627330

M3 - 会议稿件

AN - SCOPUS:85202433590

T3 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings

SP - 123

EP - 124

BT - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024

Y2 - 14 April 2024 through 19 April 2024

ER -

Chen H, Wu S, Wang C, Du J, Lee CH, Siniscalchi SM 等. Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge. 在 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2024. 页码 123-124. (2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings). doi: 10.1109/ICASSPW62465.2024.10627330

Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此