TY - JOUR
T1 - MMGER
T2 - Multi-Modal and Multi-Granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition
AU - Mu, Bingshen
AU - Wan, Xucheng
AU - Zheng, Naijun
AU - Zhou, Huan
AU - Xie, Lei
N1 - Publisher Copyright:
© 1994-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - Despite notable advancements in automatic speech recognition (ASR), performance tends to degrade when faced with adverse conditions. Generative error correction (GER) leverages the exceptional text comprehension capabilities of large language models (LLM), delivering impressive performance in ASR error correction, where N-best hypotheses provide valuable information for transcription prediction. However, GER encounters challenges such as fixed N-best hypotheses, insufficient utilization of acoustic information, and limited specificity to multi-Accent scenarios. In this paper, we explore the application of GER in multi-Accent scenarios. Accents represent deviations from standard pronunciation norms, and the multi-Task learning framework for simultaneous ASR and accent recognition (AR) has effectively addressed the multi-Accent scenarios, making it a prominent solution. In this work, we propose a unified ASR-AR GER model, named MMGER, leveraging multi-modal correction, and multi-granularity correction. Multi-Task ASR-AR learning is employed to provide dynamic 1-best hypotheses and accent embeddings. Multi-modal correction accomplishes fine-grained frame-level correction by force-Aligning the acoustic features of speech with the corresponding character-level 1-best hypothesis sequence. Multi-granularity correction supplements the global linguistic information by incorporating regular 1-best hypotheses atop fine-grained multi-modal correction to achieve coarse-grained utterance-level correction. MMGER effectively mitigates the limitations of GER and tailors LLM-based ASR error correction for the multi-Accent scenarios. Experiments conducted on the multi-Accent Mandarin KeSpeech dataset demonstrate the efficacy of MMGER, achieving a 26.72% relative improvement in AR accuracy and a 27.55% relative reduction in ASR character error rate, compared to a well-established standard baseline.
AB - Despite notable advancements in automatic speech recognition (ASR), performance tends to degrade when faced with adverse conditions. Generative error correction (GER) leverages the exceptional text comprehension capabilities of large language models (LLM), delivering impressive performance in ASR error correction, where N-best hypotheses provide valuable information for transcription prediction. However, GER encounters challenges such as fixed N-best hypotheses, insufficient utilization of acoustic information, and limited specificity to multi-Accent scenarios. In this paper, we explore the application of GER in multi-Accent scenarios. Accents represent deviations from standard pronunciation norms, and the multi-Task learning framework for simultaneous ASR and accent recognition (AR) has effectively addressed the multi-Accent scenarios, making it a prominent solution. In this work, we propose a unified ASR-AR GER model, named MMGER, leveraging multi-modal correction, and multi-granularity correction. Multi-Task ASR-AR learning is employed to provide dynamic 1-best hypotheses and accent embeddings. Multi-modal correction accomplishes fine-grained frame-level correction by force-Aligning the acoustic features of speech with the corresponding character-level 1-best hypothesis sequence. Multi-granularity correction supplements the global linguistic information by incorporating regular 1-best hypotheses atop fine-grained multi-modal correction to achieve coarse-grained utterance-level correction. MMGER effectively mitigates the limitations of GER and tailors LLM-based ASR error correction for the multi-Accent scenarios. Experiments conducted on the multi-Accent Mandarin KeSpeech dataset demonstrate the efficacy of MMGER, achieving a 26.72% relative improvement in AR accuracy and a 27.55% relative reduction in ASR character error rate, compared to a well-established standard baseline.
KW - MMGER
KW - multi-granularity correction
KW - multi-modal correction
KW - Multi-Task ASR-AR learning
UR - http://www.scopus.com/inward/record.url?scp=85199550209&partnerID=8YFLogxK
U2 - 10.1109/LSP.2024.3432275
DO - 10.1109/LSP.2024.3432275
M3 - 文章
AN - SCOPUS:85199550209
SN - 1070-9908
VL - 31
SP - 1940
EP - 1944
JO - IEEE Signal Processing Letters
JF - IEEE Signal Processing Letters
ER -