VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting

Ao Zhang; He Wang; Pengcheng Guo; Yihui Fu; Lei Xie; Yingying Gao; Shilei Zhang; Junlan Feng

doi:10.1109/ICASSP49357.2023.10096858

VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting

Ao Zhang, He Wang, Pengcheng Guo, Yihui Fu, Lei Xie, Yingying Gao, Shilei Zhang, Junlan Feng

School of Computer Science

Research output: Contribution to journal › Conference article › peer-review

Abstract

The performance of the keyword spotting (KWS) system based on audio modality, commonly measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. Therefore, audio-visual keyword spotting, which leverages complementary relationships over multiple modalities, has recently gained much attention. However, current studies mainly focus on combining the exclusively learned representations of different modalities, instead of exploring the modal relationships during each respective modeling. In this paper, we propose a novel visual modality enhanced end-to-end KWS framework (VE-KWS), which fuses audio and visual modalities from two aspects. The first one is utilizing the speaker location information obtained from the lip region in videos to assist the training of multi-channel audio beamformer. By involving the beamformer as an audio enhancement module, the acoustic distortions, caused by the far field or noisy environments, could be significantly suppressed. The other one is conducting cross-attention between different modalities to capture the inter-modal relationships and help the representation learning of each modality. Experiments on the MSIP challenge corpus show that our proposed model achieves a 2.79% false rejection rate and a 2.95% false alarm rate on the Eval set, resulting in a new SOTA performance compared with the top-ranking systems in the ICASSP2022 MISP challenge.

Original language	English
Journal	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
DOIs	https://doi.org/10.1109/ICASSP49357.2023.10096858
State	Published - 2023
Event	48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 - Rhodes Island, Greece Duration: 4 Jun 2023 → 10 Jun 2023

Keywords

Audio-Visual Keywords Spotting
Multi-Modality Fusion
Robust Keyword Spotting

Access to Document

10.1109/ICASSP49357.2023.10096858

Cite this

@article{e38e9b1bff6940cfb54a8f998058c0fc,

title = "VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting",

abstract = "The performance of the keyword spotting (KWS) system based on audio modality, commonly measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. Therefore, audio-visual keyword spotting, which leverages complementary relationships over multiple modalities, has recently gained much attention. However, current studies mainly focus on combining the exclusively learned representations of different modalities, instead of exploring the modal relationships during each respective modeling. In this paper, we propose a novel visual modality enhanced end-to-end KWS framework (VE-KWS), which fuses audio and visual modalities from two aspects. The first one is utilizing the speaker location information obtained from the lip region in videos to assist the training of multi-channel audio beamformer. By involving the beamformer as an audio enhancement module, the acoustic distortions, caused by the far field or noisy environments, could be significantly suppressed. The other one is conducting cross-attention between different modalities to capture the inter-modal relationships and help the representation learning of each modality. Experiments on the MSIP challenge corpus show that our proposed model achieves a 2.79% false rejection rate and a 2.95% false alarm rate on the Eval set, resulting in a new SOTA performance compared with the top-ranking systems in the ICASSP2022 MISP challenge.",

keywords = "Audio-Visual Keywords Spotting, Multi-Modality Fusion, Robust Keyword Spotting",

author = "Ao Zhang and He Wang and Pengcheng Guo and Yihui Fu and Lei Xie and Yingying Gao and Shilei Zhang and Junlan Feng",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 ; Conference date: 04-06-2023 Through 10-06-2023",

year = "2023",

doi = "10.1109/ICASSP49357.2023.10096858",

language = "英语",

journal = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

issn = "1520-6149",

}

TY - JOUR

T1 - VE-KWS

T2 - 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

AU - Zhang, Ao

AU - Wang, He

AU - Guo, Pengcheng

AU - Fu, Yihui

AU - Xie, Lei

AU - Gao, Yingying

AU - Zhang, Shilei

AU - Feng, Junlan

PY - 2023

Y1 - 2023

N2 - The performance of the keyword spotting (KWS) system based on audio modality, commonly measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. Therefore, audio-visual keyword spotting, which leverages complementary relationships over multiple modalities, has recently gained much attention. However, current studies mainly focus on combining the exclusively learned representations of different modalities, instead of exploring the modal relationships during each respective modeling. In this paper, we propose a novel visual modality enhanced end-to-end KWS framework (VE-KWS), which fuses audio and visual modalities from two aspects. The first one is utilizing the speaker location information obtained from the lip region in videos to assist the training of multi-channel audio beamformer. By involving the beamformer as an audio enhancement module, the acoustic distortions, caused by the far field or noisy environments, could be significantly suppressed. The other one is conducting cross-attention between different modalities to capture the inter-modal relationships and help the representation learning of each modality. Experiments on the MSIP challenge corpus show that our proposed model achieves a 2.79% false rejection rate and a 2.95% false alarm rate on the Eval set, resulting in a new SOTA performance compared with the top-ranking systems in the ICASSP2022 MISP challenge.

AB - The performance of the keyword spotting (KWS) system based on audio modality, commonly measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. Therefore, audio-visual keyword spotting, which leverages complementary relationships over multiple modalities, has recently gained much attention. However, current studies mainly focus on combining the exclusively learned representations of different modalities, instead of exploring the modal relationships during each respective modeling. In this paper, we propose a novel visual modality enhanced end-to-end KWS framework (VE-KWS), which fuses audio and visual modalities from two aspects. The first one is utilizing the speaker location information obtained from the lip region in videos to assist the training of multi-channel audio beamformer. By involving the beamformer as an audio enhancement module, the acoustic distortions, caused by the far field or noisy environments, could be significantly suppressed. The other one is conducting cross-attention between different modalities to capture the inter-modal relationships and help the representation learning of each modality. Experiments on the MSIP challenge corpus show that our proposed model achieves a 2.79% false rejection rate and a 2.95% false alarm rate on the Eval set, resulting in a new SOTA performance compared with the top-ranking systems in the ICASSP2022 MISP challenge.

KW - Audio-Visual Keywords Spotting

KW - Multi-Modality Fusion

KW - Robust Keyword Spotting

UR - http://www.scopus.com/inward/record.url?scp=85171589176&partnerID=8YFLogxK

U2 - 10.1109/ICASSP49357.2023.10096858

DO - 10.1109/ICASSP49357.2023.10096858

M3 - 会议文章

AN - SCOPUS:85171589176

SN - 1520-6149

JO - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

JF - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Y2 - 4 June 2023 through 10 June 2023

ER -

VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this