Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge

Hang Chen; Shilong Wu; Chenxi Wang; Jun Du; Chin Hui Lee; Sabato Marco Siniscalchi; Shinji Watanabe; Jingdong Chen; Odette Scharenborg; Zhong Qiu Wang; Bao Cai Yin; Jia Pan

doi:10.1109/ICASSPW62465.2024.10627330

Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge

Hang Chen, Shilong Wu, Chenxi Wang, Jun Du, Chin Hui Lee, Sabato Marco Siniscalchi, Shinji Watanabe, Jingdong Chen, Odette Scharenborg, Zhong Qiu Wang, Bao Cai Yin, Jia Pan

School of Marine Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Historically, MISP challenges have focused on audio-visual speech recognition (AVSR), where they have been particularly successful in complex acoustic scenarios. However, even the most sophisticated AVSR systems have been found to have performance limitations. Inspired by traditional robust speech recognition systems, where speech enhancement as a front-end can significantly improve accuracy, the MISP2023 challenge focused on audio-visual target speaker extraction (AVTSE). The primary goal of AVTSE is to enhance speech quality by exploiting the lip movements of the target speaker, thereby improving the final recognition performance. This paper provides a comprehensive overview of the challenge framework, describes the results, and summarizes the effective strategies employed by the contributions. In addition, we analyze the prevailing technical hurdles and provide recommendations for future directions to spur further progress in the AVTSE field.

Original language	English
Title of host publication	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	123-124
Number of pages	2
ISBN (Electronic)	9798350374513
DOIs	https://doi.org/10.1109/ICASSPW62465.2024.10627330
State	Published - 2024
Event	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Seoul, Korea, Republic of Duration: 14 Apr 2024 → 19 Apr 2024

Publication series

Name	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings

Conference

Conference	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024
Country/Territory	Korea, Republic of
City	Seoul
Period	14/04/24 → 19/04/24

Keywords

MISP challenge
audio-visual
robust speech recognition
target speaker extraction

Access to Document

10.1109/ICASSPW62465.2024.10627330

Cite this

Chen, H., Wu, S., Wang, C., Du, J., Lee, C. H., Siniscalchi, S. M., Watanabe, S., Chen, J., Scharenborg, O., Wang, Z. Q., Yin, B. C., & Pan, J. (2024). Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge. In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings (pp. 123-124). (2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSPW62465.2024.10627330

Chen, Hang ; Wu, Shilong ; Wang, Chenxi et al. / Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge. 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2024. pp. 123-124 (2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings).

@inproceedings{36567dec624549fa8c735c0869a89195,

title = "Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge",

abstract = "Historically, MISP challenges have focused on audio-visual speech recognition (AVSR), where they have been particularly successful in complex acoustic scenarios. However, even the most sophisticated AVSR systems have been found to have performance limitations. Inspired by traditional robust speech recognition systems, where speech enhancement as a front-end can significantly improve accuracy, the MISP2023 challenge focused on audio-visual target speaker extraction (AVTSE). The primary goal of AVTSE is to enhance speech quality by exploiting the lip movements of the target speaker, thereby improving the final recognition performance. This paper provides a comprehensive overview of the challenge framework, describes the results, and summarizes the effective strategies employed by the contributions. In addition, we analyze the prevailing technical hurdles and provide recommendations for future directions to spur further progress in the AVTSE field.",

keywords = "MISP challenge, audio-visual, robust speech recognition, target speaker extraction",

author = "Hang Chen and Shilong Wu and Chenxi Wang and Jun Du and Lee, {Chin Hui} and Siniscalchi, {Sabato Marco} and Shinji Watanabe and Jingdong Chen and Odette Scharenborg and Wang, {Zhong Qiu} and Yin, {Bao Cai} and Jia Pan",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 ; Conference date: 14-04-2024 Through 19-04-2024",

year = "2024",

doi = "10.1109/ICASSPW62465.2024.10627330",

language = "英语",

series = "2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "123--124",

booktitle = "2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings",

}

Chen, H, Wu, S, Wang, C, Du, J, Lee, CH, Siniscalchi, SM, Watanabe, S, Chen, J, Scharenborg, O, Wang, ZQ, Yin, BC & Pan, J 2024, Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge. in 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings. 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings, Institute of Electrical and Electronics Engineers Inc., pp. 123-124, 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024, Seoul, Korea, Republic of, 14/04/24. https://doi.org/10.1109/ICASSPW62465.2024.10627330

Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge. / Chen, Hang; Wu, Shilong; Wang, Chenxi et al.
2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2024. p. 123-124 (2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge

AU - Chen, Hang

AU - Wu, Shilong

AU - Wang, Chenxi

AU - Du, Jun

AU - Lee, Chin Hui

AU - Siniscalchi, Sabato Marco

AU - Watanabe, Shinji

AU - Chen, Jingdong

AU - Scharenborg, Odette

AU - Wang, Zhong Qiu

AU - Yin, Bao Cai

AU - Pan, Jia

PY - 2024

Y1 - 2024

N2 - Historically, MISP challenges have focused on audio-visual speech recognition (AVSR), where they have been particularly successful in complex acoustic scenarios. However, even the most sophisticated AVSR systems have been found to have performance limitations. Inspired by traditional robust speech recognition systems, where speech enhancement as a front-end can significantly improve accuracy, the MISP2023 challenge focused on audio-visual target speaker extraction (AVTSE). The primary goal of AVTSE is to enhance speech quality by exploiting the lip movements of the target speaker, thereby improving the final recognition performance. This paper provides a comprehensive overview of the challenge framework, describes the results, and summarizes the effective strategies employed by the contributions. In addition, we analyze the prevailing technical hurdles and provide recommendations for future directions to spur further progress in the AVTSE field.

AB - Historically, MISP challenges have focused on audio-visual speech recognition (AVSR), where they have been particularly successful in complex acoustic scenarios. However, even the most sophisticated AVSR systems have been found to have performance limitations. Inspired by traditional robust speech recognition systems, where speech enhancement as a front-end can significantly improve accuracy, the MISP2023 challenge focused on audio-visual target speaker extraction (AVTSE). The primary goal of AVTSE is to enhance speech quality by exploiting the lip movements of the target speaker, thereby improving the final recognition performance. This paper provides a comprehensive overview of the challenge framework, describes the results, and summarizes the effective strategies employed by the contributions. In addition, we analyze the prevailing technical hurdles and provide recommendations for future directions to spur further progress in the AVTSE field.

KW - MISP challenge

KW - audio-visual

KW - robust speech recognition

KW - target speaker extraction

UR - http://www.scopus.com/inward/record.url?scp=85202433590&partnerID=8YFLogxK

U2 - 10.1109/ICASSPW62465.2024.10627330

DO - 10.1109/ICASSPW62465.2024.10627330

M3 - 会议稿件

AN - SCOPUS:85202433590

T3 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings

SP - 123

EP - 124

BT - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024

Y2 - 14 April 2024 through 19 April 2024

ER -

Chen H, Wu S, Wang C, Du J, Lee CH, Siniscalchi SM et al. Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge. In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2024. p. 123-124. (2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings). doi: 10.1109/ICASSPW62465.2024.10627330

Summary on the Multimodal Information-Based Speech Processing (MISP) 2023 Challenge

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this