THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE: AUDIO-VISUAL TARGET SPEAKER EXTRACTION

Shilong Wu; Chenxi Wang; Hang Chen; Yusheng Dai; Chenyue Zhang; Ruoyu Wang; Hongbo Lan; Jun Du; Chin Hui Lee; Jingdong Chen; Sabato Marco Siniscalchi; Odette Scharenborg; Zhong Qiu Wang; Jia Pan; Jianqing Gao

doi:10.1109/ICASSP48485.2024.10447462

THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE: AUDIO-VISUAL TARGET SPEAKER EXTRACTION

Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang, Hongbo Lan, Jun Du, Chin Hui Lee, Jingdong Chen, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong Qiu Wang, Jia Pan, Jianqing Gao

School of Marine Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

3 Scopus citations

Abstract

Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhancement challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the accuracy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward.

Original language	English
Title of host publication	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	8351-8355
Number of pages	5
ISBN (Electronic)	9798350344851
DOIs	https://doi.org/10.1109/ICASSP48485.2024.10447462
State	Published - 2024
Event	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of Duration: 14 Apr 2024 → 19 Apr 2024

Publication series

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)	1520-6149

Conference

Conference	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024
Country/Territory	Korea, Republic of
City	Seoul
Period	14/04/24 → 19/04/24

Keywords

MISP challenge
multimodality
real-world scenarios
target speaker extraction

Access to Document

10.1109/ICASSP48485.2024.10447462

Cite this

Wu, S., Wang, C., Chen, H., Dai, Y., Zhang, C., Wang, R., Lan, H., Du, J., Lee, C. H., Chen, J., Siniscalchi, S. M., Scharenborg, O., Wang, Z. Q., Pan, J., & Gao, J. (2024). THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE: AUDIO-VISUAL TARGET SPEAKER EXTRACTION. In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings (pp. 8351-8355). (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP48485.2024.10447462

Wu, Shilong ; Wang, Chenxi ; Chen, Hang et al. / THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE : AUDIO-VISUAL TARGET SPEAKER EXTRACTION. 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2024. pp. 8351-8355 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{e112d5d7319649839c64687ab918d383,

title = "THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE: AUDIO-VISUAL TARGET SPEAKER EXTRACTION",

abstract = "Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhancement challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the accuracy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward.",

keywords = "MISP challenge, multimodality, real-world scenarios, target speaker extraction",

author = "Shilong Wu and Chenxi Wang and Hang Chen and Yusheng Dai and Chenyue Zhang and Ruoyu Wang and Hongbo Lan and Jun Du and Lee, {Chin Hui} and Jingdong Chen and Siniscalchi, {Sabato Marco} and Odette Scharenborg and Wang, {Zhong Qiu} and Jia Pan and Jianqing Gao",

note = "Publisher Copyright: {\textcopyright}2024 IEEE.; 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 ; Conference date: 14-04-2024 Through 19-04-2024",

year = "2024",

doi = "10.1109/ICASSP48485.2024.10447462",

language = "英语",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "8351--8355",

booktitle = "2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings",

}

Wu, S, Wang, C, Chen, H, Dai, Y, Zhang, C, Wang, R, Lan, H, Du, J, Lee, CH, Chen, J, Siniscalchi, SM, Scharenborg, O, Wang, ZQ, Pan, J & Gao, J 2024, THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE: AUDIO-VISUAL TARGET SPEAKER EXTRACTION. in 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Institute of Electrical and Electronics Engineers Inc., pp. 8351-8355, 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Republic of, 14/04/24. https://doi.org/10.1109/ICASSP48485.2024.10447462

THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE: AUDIO-VISUAL TARGET SPEAKER EXTRACTION. / Wu, Shilong; Wang, Chenxi; Chen, Hang et al.
2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2024. p. 8351-8355 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE

T2 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024

AU - Wu, Shilong

AU - Wang, Chenxi

AU - Chen, Hang

AU - Dai, Yusheng

AU - Zhang, Chenyue

AU - Wang, Ruoyu

AU - Lan, Hongbo

AU - Du, Jun

AU - Lee, Chin Hui

AU - Chen, Jingdong

AU - Siniscalchi, Sabato Marco

AU - Scharenborg, Odette

AU - Wang, Zhong Qiu

AU - Pan, Jia

AU - Gao, Jianqing

PY - 2024

Y1 - 2024

N2 - Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhancement challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the accuracy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward.

AB - Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhancement challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the accuracy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward.

KW - MISP challenge

KW - multimodality

KW - real-world scenarios

KW - target speaker extraction

UR - http://www.scopus.com/inward/record.url?scp=85188695414&partnerID=8YFLogxK

U2 - 10.1109/ICASSP48485.2024.10447462

DO - 10.1109/ICASSP48485.2024.10447462

M3 - 会议稿件

AN - SCOPUS:85188695414

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 8351

EP - 8355

BT - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 14 April 2024 through 19 April 2024

ER -

Wu S, Wang C, Chen H, Dai Y, Zhang C, Wang R et al. THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE: AUDIO-VISUAL TARGET SPEAKER EXTRACTION. In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2024. p. 8351-8355. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP48485.2024.10447462

THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE: AUDIO-VISUAL TARGET SPEAKER EXTRACTION

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this