AUTOMATIC CHANNEL SELECTION AND SPATIAL FEATURE INTEGRATION FOR MULTI-CHANNEL SPEECH RECOGNITION ACROSS VARIOUS ARRAY TOPOLOGIES

Bingshen Mu; Pengcheng Guo; Dake Guo; Pan Zhou; Wei Chen; Lei Xie

doi:10.1109/ICASSP48485.2024.10447563

AUTOMATIC CHANNEL SELECTION AND SPATIAL FEATURE INTEGRATION FOR MULTI-CHANNEL SPEECH RECOGNITION ACROSS VARIOUS ARRAY TOPOLOGIES

Bingshen Mu, Pengcheng Guo, Dake Guo, Pan Zhou, Wei Chen, Lei Xie

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

1 Scopus citations

Abstract

Automatic Speech Recognition (ASR) has shown remarkable progress, yet it still faces challenges in real-world distant scenarios across various array topologies each with multiple recording devices. The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition performance in real-world environments. Addressing this task, we introduce an ASR system that demonstrates exceptional performance across various array topologies. First of all, we propose two attention-based automatic channel selection modules to select the most advantageous subset of multi-channel signals from multiple recording devices for each utterance. Furthermore, we introduce inter-channel spatial features to augment the effectiveness of multiframe cross-channel attention, aiding it in improving the capability of spatial information awareness. Finally, we propose a multi-layer convolution fusion module drawing inspiration from the U-Net architecture to integrate the multi-channel output into a single-channel output. Experimental results on the CHiME-7 corpus with oracle segmentation demonstrate that the improvements introduced in our proposed ASR system lead to a relative reduction of 40.1% in the Macro Diarization Attributed Word Error Rates (DA-WER) when compared to the baseline ASR system on the Eval sets.

Original language	English
Title of host publication	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	11396-11400
Number of pages	5
ISBN (Electronic)	9798350344851
DOIs	https://doi.org/10.1109/ICASSP48485.2024.10447563
State	Published - 2024
Event	49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of Duration: 14 Apr 2024 → 19 Apr 2024

Publication series

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)	1520-6149

Conference

Conference	49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024
Country/Territory	Korea, Republic of
City	Seoul
Period	14/04/24 → 19/04/24

Keywords

channel selection
Distant automatic speech recognition
multi-channel fusion
spatial features

Access to Document

10.1109/ICASSP48485.2024.10447563

Cite this

Mu, B., Guo, P., Guo, D., Zhou, P., Chen, W., & Xie, L. (2024). AUTOMATIC CHANNEL SELECTION AND SPATIAL FEATURE INTEGRATION FOR MULTI-CHANNEL SPEECH RECOGNITION ACROSS VARIOUS ARRAY TOPOLOGIES. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings (pp. 11396-11400). (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP48485.2024.10447563

Mu, Bingshen ; Guo, Pengcheng ; Guo, Dake et al. / AUTOMATIC CHANNEL SELECTION AND SPATIAL FEATURE INTEGRATION FOR MULTI-CHANNEL SPEECH RECOGNITION ACROSS VARIOUS ARRAY TOPOLOGIES. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2024. pp. 11396-11400 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{f3cba9790bfe415eb65011542de28fb7,

title = "AUTOMATIC CHANNEL SELECTION AND SPATIAL FEATURE INTEGRATION FOR MULTI-CHANNEL SPEECH RECOGNITION ACROSS VARIOUS ARRAY TOPOLOGIES",

abstract = "Automatic Speech Recognition (ASR) has shown remarkable progress, yet it still faces challenges in real-world distant scenarios across various array topologies each with multiple recording devices. The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition performance in real-world environments. Addressing this task, we introduce an ASR system that demonstrates exceptional performance across various array topologies. First of all, we propose two attention-based automatic channel selection modules to select the most advantageous subset of multi-channel signals from multiple recording devices for each utterance. Furthermore, we introduce inter-channel spatial features to augment the effectiveness of multiframe cross-channel attention, aiding it in improving the capability of spatial information awareness. Finally, we propose a multi-layer convolution fusion module drawing inspiration from the U-Net architecture to integrate the multi-channel output into a single-channel output. Experimental results on the CHiME-7 corpus with oracle segmentation demonstrate that the improvements introduced in our proposed ASR system lead to a relative reduction of 40.1% in the Macro Diarization Attributed Word Error Rates (DA-WER) when compared to the baseline ASR system on the Eval sets.",

keywords = "channel selection, Distant automatic speech recognition, multi-channel fusion, spatial features",

author = "Bingshen Mu and Pengcheng Guo and Dake Guo and Pan Zhou and Wei Chen and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 ; Conference date: 14-04-2024 Through 19-04-2024",

year = "2024",

doi = "10.1109/ICASSP48485.2024.10447563",

language = "英语",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "11396--11400",

booktitle = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

}

Mu, B, Guo, P, Guo, D, Zhou, P, Chen, W & Xie, L 2024, AUTOMATIC CHANNEL SELECTION AND SPATIAL FEATURE INTEGRATION FOR MULTI-CHANNEL SPEECH RECOGNITION ACROSS VARIOUS ARRAY TOPOLOGIES. in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Institute of Electrical and Electronics Engineers Inc., pp. 11396-11400, 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Republic of, 14/04/24. https://doi.org/10.1109/ICASSP48485.2024.10447563

AUTOMATIC CHANNEL SELECTION AND SPATIAL FEATURE INTEGRATION FOR MULTI-CHANNEL SPEECH RECOGNITION ACROSS VARIOUS ARRAY TOPOLOGIES. / Mu, Bingshen; Guo, Pengcheng; Guo, Dake et al.
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2024. p. 11396-11400 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - AUTOMATIC CHANNEL SELECTION AND SPATIAL FEATURE INTEGRATION FOR MULTI-CHANNEL SPEECH RECOGNITION ACROSS VARIOUS ARRAY TOPOLOGIES

AU - Mu, Bingshen

AU - Guo, Pengcheng

AU - Guo, Dake

AU - Zhou, Pan

AU - Chen, Wei

AU - Xie, Lei

PY - 2024

Y1 - 2024

N2 - Automatic Speech Recognition (ASR) has shown remarkable progress, yet it still faces challenges in real-world distant scenarios across various array topologies each with multiple recording devices. The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition performance in real-world environments. Addressing this task, we introduce an ASR system that demonstrates exceptional performance across various array topologies. First of all, we propose two attention-based automatic channel selection modules to select the most advantageous subset of multi-channel signals from multiple recording devices for each utterance. Furthermore, we introduce inter-channel spatial features to augment the effectiveness of multiframe cross-channel attention, aiding it in improving the capability of spatial information awareness. Finally, we propose a multi-layer convolution fusion module drawing inspiration from the U-Net architecture to integrate the multi-channel output into a single-channel output. Experimental results on the CHiME-7 corpus with oracle segmentation demonstrate that the improvements introduced in our proposed ASR system lead to a relative reduction of 40.1% in the Macro Diarization Attributed Word Error Rates (DA-WER) when compared to the baseline ASR system on the Eval sets.

AB - Automatic Speech Recognition (ASR) has shown remarkable progress, yet it still faces challenges in real-world distant scenarios across various array topologies each with multiple recording devices. The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition performance in real-world environments. Addressing this task, we introduce an ASR system that demonstrates exceptional performance across various array topologies. First of all, we propose two attention-based automatic channel selection modules to select the most advantageous subset of multi-channel signals from multiple recording devices for each utterance. Furthermore, we introduce inter-channel spatial features to augment the effectiveness of multiframe cross-channel attention, aiding it in improving the capability of spatial information awareness. Finally, we propose a multi-layer convolution fusion module drawing inspiration from the U-Net architecture to integrate the multi-channel output into a single-channel output. Experimental results on the CHiME-7 corpus with oracle segmentation demonstrate that the improvements introduced in our proposed ASR system lead to a relative reduction of 40.1% in the Macro Diarization Attributed Word Error Rates (DA-WER) when compared to the baseline ASR system on the Eval sets.

KW - channel selection

KW - Distant automatic speech recognition

KW - multi-channel fusion

KW - spatial features

UR - http://www.scopus.com/inward/record.url?scp=85194473290&partnerID=8YFLogxK

U2 - 10.1109/ICASSP48485.2024.10447563

DO - 10.1109/ICASSP48485.2024.10447563

M3 - 会议稿件

AN - SCOPUS:85194473290

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 11396

EP - 11400

BT - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024

Y2 - 14 April 2024 through 19 April 2024

ER -

Mu B, Guo P, Guo D, Zhou P, Chen W, Xie L. AUTOMATIC CHANNEL SELECTION AND SPATIAL FEATURE INTEGRATION FOR MULTI-CHANNEL SPEECH RECOGNITION ACROSS VARIOUS ARRAY TOPOLOGIES. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2024. p. 11396-11400. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP48485.2024.10447563

AUTOMATIC CHANNEL SELECTION AND SPATIAL FEATURE INTEGRATION FOR MULTI-CHANNEL SPEECH RECOGNITION ACROSS VARIOUS ARRAY TOPOLOGIES

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this