CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition

He Wang; Xucheng Wan; Naijun Zheng; Kai Liu; Huan Zhou; Guojian Li; Lei Xie

doi:10.1109/ICASSP49660.2025.10890869

CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition

He Wang, Xucheng Wan, Naijun Zheng, Kai Liu, Huan Zhou, Guojian Li, Lei Xie

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Code-switching automatic speech recognition (ASR) aims to transcribe speech that contains two or more languages accurately. To better capture language-specific speech representations and address language confusion in code-switching ASR, the mixture-of-experts (MoE) architecture and an additional language diarization (LD) decoder are commonly employed. However, most researches remain stagnant in simple operations like weighted summation or concatenation to fuse language-specific speech representations, leaving significant opportunities to explore the enhancement of integrating language bias information. In this paper, we introduce CAMEL, a cross-attention-based MoE and language bias approach for code-switching ASR. Specifically, after each MoE layer, we fuse language-specific speech representations with cross-attention, leveraging its strong contextual modeling abilities. Additionally, we design a source attention-based mechanism to incorporate the language information from the LD decoder output into text embeddings. Experimental results demonstrate that our approach achieves state-of-the-art performance on the SEAME, ASRU200, and ASRU700+LibriSpeech460 Mandarin-English code-switching ASR datasets.

Original language	English
Title of host publication	2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings
Editors	Bhaskar D Rao, Isabel Trancoso, Gaurav Sharma, Neelesh B. Mehta
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	9798350368741
DOIs	https://doi.org/10.1109/ICASSP49660.2025.10890869
State	Published - 2025
Event	2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Hyderabad, India Duration: 6 Apr 2025 → 11 Apr 2025

Publication series

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)	1520-6149

Conference

Conference	2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025
Country/Territory	India
City	Hyderabad
Period	6/04/25 → 11/04/25

Keywords

code-switching
cross-attention
language bias
mixture-of-experts
speech recognition

Access to Document

10.1109/ICASSP49660.2025.10890869

Cite this

Wang, H., Wan, X., Zheng, N., Liu, K., Zhou, H., Li, G., & Xie, L. (2025). CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition. In B. D. Rao, I. Trancoso, G. Sharma, & N. B. Mehta (Eds.), 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP49660.2025.10890869

Wang, He ; Wan, Xucheng ; Zheng, Naijun et al. / CAMEL : Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition. 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings. editor / Bhaskar D Rao ; Isabel Trancoso ; Gaurav Sharma ; Neelesh B. Mehta. Institute of Electrical and Electronics Engineers Inc., 2025. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{787e711f8dd84804876a5ac67938490c,

title = "CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition",

abstract = "Code-switching automatic speech recognition (ASR) aims to transcribe speech that contains two or more languages accurately. To better capture language-specific speech representations and address language confusion in code-switching ASR, the mixture-of-experts (MoE) architecture and an additional language diarization (LD) decoder are commonly employed. However, most researches remain stagnant in simple operations like weighted summation or concatenation to fuse language-specific speech representations, leaving significant opportunities to explore the enhancement of integrating language bias information. In this paper, we introduce CAMEL, a cross-attention-based MoE and language bias approach for code-switching ASR. Specifically, after each MoE layer, we fuse language-specific speech representations with cross-attention, leveraging its strong contextual modeling abilities. Additionally, we design a source attention-based mechanism to incorporate the language information from the LD decoder output into text embeddings. Experimental results demonstrate that our approach achieves state-of-the-art performance on the SEAME, ASRU200, and ASRU700+LibriSpeech460 Mandarin-English code-switching ASR datasets.",

keywords = "code-switching, cross-attention, language bias, mixture-of-experts, speech recognition",

author = "He Wang and Xucheng Wan and Naijun Zheng and Kai Liu and Huan Zhou and Guojian Li and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2025 IEEE.; 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 ; Conference date: 06-04-2025 Through 11-04-2025",

year = "2025",

doi = "10.1109/ICASSP49660.2025.10890869",

language = "英语",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

editor = "Rao, {Bhaskar D} and Isabel Trancoso and Gaurav Sharma and Mehta, {Neelesh B.}",

booktitle = "2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings",

}

Wang, H, Wan, X, Zheng, N, Liu, K, Zhou, H, Li, G & Xie, L 2025, CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition. in BD Rao, I Trancoso, G Sharma & NB Mehta (eds), 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Institute of Electrical and Electronics Engineers Inc., 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025, Hyderabad, India, 6/04/25. https://doi.org/10.1109/ICASSP49660.2025.10890869

CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition. / Wang, He; Wan, Xucheng; Zheng, Naijun et al.
2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings. ed. / Bhaskar D Rao; Isabel Trancoso; Gaurav Sharma; Neelesh B. Mehta. Institute of Electrical and Electronics Engineers Inc., 2025. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - CAMEL

T2 - 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025

AU - Wang, He

AU - Wan, Xucheng

AU - Zheng, Naijun

AU - Liu, Kai

AU - Zhou, Huan

AU - Li, Guojian

AU - Xie, Lei

PY - 2025

Y1 - 2025

N2 - Code-switching automatic speech recognition (ASR) aims to transcribe speech that contains two or more languages accurately. To better capture language-specific speech representations and address language confusion in code-switching ASR, the mixture-of-experts (MoE) architecture and an additional language diarization (LD) decoder are commonly employed. However, most researches remain stagnant in simple operations like weighted summation or concatenation to fuse language-specific speech representations, leaving significant opportunities to explore the enhancement of integrating language bias information. In this paper, we introduce CAMEL, a cross-attention-based MoE and language bias approach for code-switching ASR. Specifically, after each MoE layer, we fuse language-specific speech representations with cross-attention, leveraging its strong contextual modeling abilities. Additionally, we design a source attention-based mechanism to incorporate the language information from the LD decoder output into text embeddings. Experimental results demonstrate that our approach achieves state-of-the-art performance on the SEAME, ASRU200, and ASRU700+LibriSpeech460 Mandarin-English code-switching ASR datasets.

AB - Code-switching automatic speech recognition (ASR) aims to transcribe speech that contains two or more languages accurately. To better capture language-specific speech representations and address language confusion in code-switching ASR, the mixture-of-experts (MoE) architecture and an additional language diarization (LD) decoder are commonly employed. However, most researches remain stagnant in simple operations like weighted summation or concatenation to fuse language-specific speech representations, leaving significant opportunities to explore the enhancement of integrating language bias information. In this paper, we introduce CAMEL, a cross-attention-based MoE and language bias approach for code-switching ASR. Specifically, after each MoE layer, we fuse language-specific speech representations with cross-attention, leveraging its strong contextual modeling abilities. Additionally, we design a source attention-based mechanism to incorporate the language information from the LD decoder output into text embeddings. Experimental results demonstrate that our approach achieves state-of-the-art performance on the SEAME, ASRU200, and ASRU700+LibriSpeech460 Mandarin-English code-switching ASR datasets.

KW - code-switching

KW - cross-attention

KW - language bias

KW - mixture-of-experts

KW - speech recognition

UR - http://www.scopus.com/inward/record.url?scp=105003868330&partnerID=8YFLogxK

U2 - 10.1109/ICASSP49660.2025.10890869

DO - 10.1109/ICASSP49660.2025.10890869

M3 - 会议稿件

AN - SCOPUS:105003868330

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

BT - 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings

A2 - Rao, Bhaskar D

A2 - Trancoso, Isabel

A2 - Sharma, Gaurav

A2 - Mehta, Neelesh B.

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 6 April 2025 through 11 April 2025

ER -

Wang H, Wan X, Zheng N, Liu K, Zhou H, Li G et al. CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition. In Rao BD, Trancoso I, Sharma G, Mehta NB, editors, 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2025. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP49660.2025.10890869

CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this