THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE: AUDIO-VISUAL TARGET SPEAKER EXTRACTION

Shilong Wu; Chenxi Wang; Hang Chen; Yusheng Dai; Chenyue Zhang; Ruoyu Wang; Hongbo Lan; Jun Du; Chin Hui Lee; Jingdong Chen; Sabato Marco Siniscalchi; Odette Scharenborg; Zhong Qiu Wang; Jia Pan; Jianqing Gao

doi:10.1109/ICASSP48485.2024.10447462

THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE: AUDIO-VISUAL TARGET SPEAKER EXTRACTION

Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang, Hongbo Lan, Jun Du, Chin Hui Lee, Jingdong Chen, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong Qiu Wang, Jia Pan, Jianqing Gao

航海学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

3 引用（Scopus）

摘要

Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhancement challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the accuracy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward.

源语言	英语
主期刊名	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings
出版商	Institute of Electrical and Electronics Engineers Inc.
页	8351-8355
页数	5
ISBN（电子版）	9798350344851
DOI	https://doi.org/10.1109/ICASSP48485.2024.10447462
出版状态	已出版 - 2024
活动	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, 韩国期限: 14 4月 2024 → 19 4月 2024

出版系列

姓名	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN（印刷版）	1520-6149

会议

会议	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024
国家/地区	韩国
市	Seoul
时期	14/04/24 → 19/04/24

访问文件

10.1109/ICASSP48485.2024.10447462

其它文件与链接

链接到 Scopus 的出版物

引用此

Wu, S., Wang, C., Chen, H., Dai, Y., Zhang, C., Wang, R., Lan, H., Du, J., Lee, C. H., Chen, J., Siniscalchi, S. M., Scharenborg, O., Wang, Z. Q., Pan, J., & Gao, J. (2024). THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE: AUDIO-VISUAL TARGET SPEAKER EXTRACTION. 在 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings (页码 8351-8355). (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP48485.2024.10447462

Wu, Shilong ; Wang, Chenxi ; Chen, Hang 等. / THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE : AUDIO-VISUAL TARGET SPEAKER EXTRACTION. 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2024. 页码 8351-8355 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{e112d5d7319649839c64687ab918d383,

title = "THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE: AUDIO-VISUAL TARGET SPEAKER EXTRACTION",

abstract = "Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhancement challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the accuracy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward.",

keywords = "MISP challenge, multimodality, real-world scenarios, target speaker extraction",

author = "Shilong Wu and Chenxi Wang and Hang Chen and Yusheng Dai and Chenyue Zhang and Ruoyu Wang and Hongbo Lan and Jun Du and Lee, {Chin Hui} and Jingdong Chen and Siniscalchi, {Sabato Marco} and Odette Scharenborg and Wang, {Zhong Qiu} and Jia Pan and Jianqing Gao",

note = "Publisher Copyright: {\textcopyright}2024 IEEE.; 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 ; Conference date: 14-04-2024 Through 19-04-2024",

year = "2024",

doi = "10.1109/ICASSP48485.2024.10447462",

language = "英语",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "8351--8355",

booktitle = "2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings",

}

Wu, S, Wang, C, Chen, H, Dai, Y, Zhang, C, Wang, R, Lan, H, Du, J, Lee, CH, Chen, J, Siniscalchi, SM, Scharenborg, O, Wang, ZQ, Pan, J & Gao, J 2024, THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE: AUDIO-VISUAL TARGET SPEAKER EXTRACTION. 在 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Institute of Electrical and Electronics Engineers Inc., 页码 8351-8355, 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, 韩国, 14/04/24. https://doi.org/10.1109/ICASSP48485.2024.10447462

THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE: AUDIO-VISUAL TARGET SPEAKER EXTRACTION. / Wu, Shilong; Wang, Chenxi; Chen, Hang 等.
2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2024. 页码 8351-8355 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE

T2 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024

AU - Wu, Shilong

AU - Wang, Chenxi

AU - Chen, Hang

AU - Dai, Yusheng

AU - Zhang, Chenyue

AU - Wang, Ruoyu

AU - Lan, Hongbo

AU - Du, Jun

AU - Lee, Chin Hui

AU - Chen, Jingdong

AU - Siniscalchi, Sabato Marco

AU - Scharenborg, Odette

AU - Wang, Zhong Qiu

AU - Pan, Jia

AU - Gao, Jianqing

PY - 2024

Y1 - 2024

N2 - Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhancement challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the accuracy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward.

AB - Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhancement challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the accuracy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward.

KW - MISP challenge

KW - multimodality

KW - real-world scenarios

KW - target speaker extraction

UR - http://www.scopus.com/inward/record.url?scp=85188695414&partnerID=8YFLogxK

U2 - 10.1109/ICASSP48485.2024.10447462

DO - 10.1109/ICASSP48485.2024.10447462

M3 - 会议稿件

AN - SCOPUS:85188695414

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 8351

EP - 8355

BT - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 14 April 2024 through 19 April 2024

ER -

Wu S, Wang C, Chen H, Dai Y, Zhang C, Wang R 等. THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE: AUDIO-VISUAL TARGET SPEAKER EXTRACTION. 在 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2024. 页码 8351-8355. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP48485.2024.10447462

THE MULTIMODAL INFORMATION BASED SPEECH PROCESSING (MISP) 2023 CHALLENGE: AUDIO-VISUAL TARGET SPEAKER EXTRACTION

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此