MFCCA:Multi-Frame Cross-Channel Attention for Multi-Speaker ASR in Multi-Party Meeting Scenario

Fan Yu; Shiliang Zhang; Pengcheng Guo; Yuhao Liang; Zhihao Du; Yuxiao Lin; Lei Xie

doi:10.1109/SLT54892.2023.10022715

MFCCA:Multi-Frame Cross-Channel Attention for Multi-Speaker ASR in Multi-Party Meeting Scenario

Fan Yu, Shiliang Zhang, Pengcheng Guo, Yuhao Liang, Zhihao Du, Yuxiao Lin, Lei Xie

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

11 Scopus citations

Abstract

Recently cross-channel attention, which better leverages multi-channel signals from microphone array, has shown promising results in the multi-party meeting scenario. Cross-channel attention focuses on either learning global correlations between sequences of different channels or exploiting fine-grained channel-wise information effectively at each time step. Considering the delay of microphone array receiving sound, we propose a multi-frame cross-channel attention, which models cross-channel information between adjacent frames to exploit the complementarity of both frame-wise and channel-wise knowledge. Besides, we also propose a multi-layer convolutional mechanism to fuse the multi -channel output and a channel masking strategy to combat the channel number mismatch problem between training and inference. Experiments on the AliMeeting, a real-world corpus, reveal that our proposed model outperforms single-channel model by 31.7% and 37.0% CER reduction on Eval and Test sets. Moreover, with comparable model parameters and training data, our proposed model achieves a new SOTA performance on the AliMeeting corpus, as compared with the top ranking systems in the ICASSP2022 M2MeT challenge, a recently held multi-channel multi-speaker ASR challenge.

Original language	English
Title of host publication	2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	144-151
Number of pages	8
ISBN (Electronic)	9798350396904
DOIs	https://doi.org/10.1109/SLT54892.2023.10022715
State	Published - 2023
Event	2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Doha, Qatar Duration: 9 Jan 2023 → 12 Jan 2023

Publication series

Name	2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings

Conference

Conference	2022 IEEE Spoken Language Technology Workshop, SLT 2022
Country/Territory	Qatar
City	Doha
Period	9/01/23 → 12/01/23

Keywords

AliMeeting
M2MeT
Multi-speaker ASR
cross-channel attention
multi-channel

Access to Document

10.1109/SLT54892.2023.10022715

Cite this

Yu, F., Zhang, S., Guo, P., Liang, Y., Du, Z., Lin, Y., & Xie, L. (2023). MFCCA:Multi-Frame Cross-Channel Attention for Multi-Speaker ASR in Multi-Party Meeting Scenario. In 2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings (pp. 144-151). (2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/SLT54892.2023.10022715

@inproceedings{5fcf5729c12c49e482ab84aff0556ed5,

title = "MFCCA:Multi-Frame Cross-Channel Attention for Multi-Speaker ASR in Multi-Party Meeting Scenario",

abstract = "Recently cross-channel attention, which better leverages multi-channel signals from microphone array, has shown promising results in the multi-party meeting scenario. Cross-channel attention focuses on either learning global correlations between sequences of different channels or exploiting fine-grained channel-wise information effectively at each time step. Considering the delay of microphone array receiving sound, we propose a multi-frame cross-channel attention, which models cross-channel information between adjacent frames to exploit the complementarity of both frame-wise and channel-wise knowledge. Besides, we also propose a multi-layer convolutional mechanism to fuse the multi -channel output and a channel masking strategy to combat the channel number mismatch problem between training and inference. Experiments on the AliMeeting, a real-world corpus, reveal that our proposed model outperforms single-channel model by 31.7% and 37.0% CER reduction on Eval and Test sets. Moreover, with comparable model parameters and training data, our proposed model achieves a new SOTA performance on the AliMeeting corpus, as compared with the top ranking systems in the ICASSP2022 M2MeT challenge, a recently held multi-channel multi-speaker ASR challenge.",

keywords = "AliMeeting, M2MeT, Multi-speaker ASR, cross-channel attention, multi-channel",

author = "Fan Yu and Shiliang Zhang and Pengcheng Guo and Yuhao Liang and Zhihao Du and Yuxiao Lin and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2022 IEEE Spoken Language Technology Workshop, SLT 2022 ; Conference date: 09-01-2023 Through 12-01-2023",

year = "2023",

doi = "10.1109/SLT54892.2023.10022715",

language = "英语",

series = "2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "144--151",

booktitle = "2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings",

}

Yu, F, Zhang, S, Guo, P, Liang, Y, Du, Z, Lin, Y & Xie, L 2023, MFCCA:Multi-Frame Cross-Channel Attention for Multi-Speaker ASR in Multi-Party Meeting Scenario. in 2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings. 2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings, Institute of Electrical and Electronics Engineers Inc., pp. 144-151, 2022 IEEE Spoken Language Technology Workshop, SLT 2022, Doha, Qatar, 9/01/23. https://doi.org/10.1109/SLT54892.2023.10022715

MFCCA:Multi-Frame Cross-Channel Attention for Multi-Speaker ASR in Multi-Party Meeting Scenario. / Yu, Fan; Zhang, Shiliang; Guo, Pengcheng et al.
2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2023. p. 144-151 (2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - MFCCA:Multi-Frame Cross-Channel Attention for Multi-Speaker ASR in Multi-Party Meeting Scenario

AU - Yu, Fan

AU - Zhang, Shiliang

AU - Guo, Pengcheng

AU - Liang, Yuhao

AU - Du, Zhihao

AU - Lin, Yuxiao

AU - Xie, Lei

PY - 2023

Y1 - 2023

N2 - Recently cross-channel attention, which better leverages multi-channel signals from microphone array, has shown promising results in the multi-party meeting scenario. Cross-channel attention focuses on either learning global correlations between sequences of different channels or exploiting fine-grained channel-wise information effectively at each time step. Considering the delay of microphone array receiving sound, we propose a multi-frame cross-channel attention, which models cross-channel information between adjacent frames to exploit the complementarity of both frame-wise and channel-wise knowledge. Besides, we also propose a multi-layer convolutional mechanism to fuse the multi -channel output and a channel masking strategy to combat the channel number mismatch problem between training and inference. Experiments on the AliMeeting, a real-world corpus, reveal that our proposed model outperforms single-channel model by 31.7% and 37.0% CER reduction on Eval and Test sets. Moreover, with comparable model parameters and training data, our proposed model achieves a new SOTA performance on the AliMeeting corpus, as compared with the top ranking systems in the ICASSP2022 M2MeT challenge, a recently held multi-channel multi-speaker ASR challenge.

AB - Recently cross-channel attention, which better leverages multi-channel signals from microphone array, has shown promising results in the multi-party meeting scenario. Cross-channel attention focuses on either learning global correlations between sequences of different channels or exploiting fine-grained channel-wise information effectively at each time step. Considering the delay of microphone array receiving sound, we propose a multi-frame cross-channel attention, which models cross-channel information between adjacent frames to exploit the complementarity of both frame-wise and channel-wise knowledge. Besides, we also propose a multi-layer convolutional mechanism to fuse the multi -channel output and a channel masking strategy to combat the channel number mismatch problem between training and inference. Experiments on the AliMeeting, a real-world corpus, reveal that our proposed model outperforms single-channel model by 31.7% and 37.0% CER reduction on Eval and Test sets. Moreover, with comparable model parameters and training data, our proposed model achieves a new SOTA performance on the AliMeeting corpus, as compared with the top ranking systems in the ICASSP2022 M2MeT challenge, a recently held multi-channel multi-speaker ASR challenge.

KW - AliMeeting

KW - M2MeT

KW - Multi-speaker ASR

KW - cross-channel attention

KW - multi-channel

UR - http://www.scopus.com/inward/record.url?scp=85147796650&partnerID=8YFLogxK

U2 - 10.1109/SLT54892.2023.10022715

DO - 10.1109/SLT54892.2023.10022715

M3 - 会议稿件

AN - SCOPUS:85147796650

T3 - 2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings

SP - 144

EP - 151

BT - 2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2022 IEEE Spoken Language Technology Workshop, SLT 2022

Y2 - 9 January 2023 through 12 January 2023

ER -

Yu F, Zhang S, Guo P, Liang Y, Du Z, Lin Y et al. MFCCA:Multi-Frame Cross-Channel Attention for Multi-Speaker ASR in Multi-Party Meeting Scenario. In 2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2023. p. 144-151. (2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings). doi: 10.1109/SLT54892.2023.10022715

MFCCA:Multi-Frame Cross-Channel Attention for Multi-Speaker ASR in Multi-Party Meeting Scenario

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this