A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Yangze Li, Xiong Wang, Songjun Cao, Yike Zhang, Long Ma, Lei Xie

科研成果: 期刊稿件会议文章同行评审

摘要

Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR expert as a transcription tokenizer and a hybrid Autoregressive (AR) Nonautoregressive (NAR) decoding approach to solve the above problems. Experiments on 10k-hour WenetSpeech Mandarin corpus show that our approach decreases 12.2% and 9.6% CER relatively on Test Net and Test Meeting evaluation sets compared with baseline. Notably, we reduce the decoding repetition rate on the evaluation set to zero, showing that the decoding repetition problem has been solved fundamentally.

源语言英语
页(从-至)1905-1909
页数5
期刊Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOI
出版状态已出版 - 2024
活动25th Interspeech Conferece 2024 - Kos Island, 希腊
期限: 1 9月 20245 9月 2024

指纹

探究 'A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition' 的科研主题。它们共同构成独一无二的指纹。

引用此