TY - JOUR
T1 - A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition
AU - Li, Yangze
AU - Wang, Xiong
AU - Cao, Songjun
AU - Zhang, Yike
AU - Ma, Long
AU - Xie, Lei
N1 - Publisher Copyright:
© 2024 International Speech Communication Association. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR expert as a transcription tokenizer and a hybrid Autoregressive (AR) Nonautoregressive (NAR) decoding approach to solve the above problems. Experiments on 10k-hour WenetSpeech Mandarin corpus show that our approach decreases 12.2% and 9.6% CER relatively on Test Net and Test Meeting evaluation sets compared with baseline. Notably, we reduce the decoding repetition rate on the evaluation set to zero, showing that the decoding repetition problem has been solved fundamentally.
AB - Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR expert as a transcription tokenizer and a hybrid Autoregressive (AR) Nonautoregressive (NAR) decoding approach to solve the above problems. Experiments on 10k-hour WenetSpeech Mandarin corpus show that our approach decreases 12.2% and 9.6% CER relatively on Test Net and Test Meeting evaluation sets compared with baseline. Notably, we reduce the decoding repetition rate on the evaluation set to zero, showing that the decoding repetition problem has been solved fundamentally.
KW - audio-LLM
KW - decoding repetition
KW - hallucination of LLM
KW - speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85214838819&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2024-968
DO - 10.21437/Interspeech.2024-968
M3 - 会议文章
AN - SCOPUS:85214838819
SN - 2308-457X
SP - 1905
EP - 1909
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 25th Interspeech Conferece 2024
Y2 - 1 September 2024 through 5 September 2024
ER -