A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Yangze Li; Xiong Wang; Songjun Cao; Yike Zhang; Long Ma; Lei Xie

doi:10.21437/Interspeech.2024-968

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Yangze Li, Xiong Wang, Songjun Cao, Yike Zhang, Long Ma, Lei Xie

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

摘要

Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR expert as a transcription tokenizer and a hybrid Autoregressive (AR) Nonautoregressive (NAR) decoding approach to solve the above problems. Experiments on 10k-hour WenetSpeech Mandarin corpus show that our approach decreases 12.2% and 9.6% CER relatively on Test Net and Test Meeting evaluation sets compared with baseline. Notably, we reduce the decoding repetition rate on the evaluation set to zero, showing that the decoding repetition problem has been solved fundamentally.

源语言	英语
页（从-至）	1905-1909
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOI	https://doi.org/10.21437/Interspeech.2024-968
出版状态	已出版 - 2024
活动	25th Interspeech Conferece 2024 - Kos Island, 希腊期限: 1 9月 2024 → 5 9月 2024

访问文件

10.21437/Interspeech.2024-968

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{6b4cbe2b5622425a8e731a57183eb289,

title = "A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition",

abstract = "Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR expert as a transcription tokenizer and a hybrid Autoregressive (AR) Nonautoregressive (NAR) decoding approach to solve the above problems. Experiments on 10k-hour WenetSpeech Mandarin corpus show that our approach decreases 12.2% and 9.6% CER relatively on Test Net and Test Meeting evaluation sets compared with baseline. Notably, we reduce the decoding repetition rate on the evaluation set to zero, showing that the decoding repetition problem has been solved fundamentally.",

keywords = "audio-LLM, decoding repetition, hallucination of LLM, speech recognition",

author = "Yangze Li and Xiong Wang and Songjun Cao and Yike Zhang and Long Ma and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2024 International Speech Communication Association. All rights reserved.; 25th Interspeech Conferece 2024 ; Conference date: 01-09-2024 Through 05-09-2024",

year = "2024",

doi = "10.21437/Interspeech.2024-968",

language = "英语",

pages = "1905--1909",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

AU - Li, Yangze

AU - Wang, Xiong

AU - Cao, Songjun

AU - Zhang, Yike

AU - Ma, Long

AU - Xie, Lei

PY - 2024

Y1 - 2024

N2 - Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR expert as a transcription tokenizer and a hybrid Autoregressive (AR) Nonautoregressive (NAR) decoding approach to solve the above problems. Experiments on 10k-hour WenetSpeech Mandarin corpus show that our approach decreases 12.2% and 9.6% CER relatively on Test Net and Test Meeting evaluation sets compared with baseline. Notably, we reduce the decoding repetition rate on the evaluation set to zero, showing that the decoding repetition problem has been solved fundamentally.

AB - Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR expert as a transcription tokenizer and a hybrid Autoregressive (AR) Nonautoregressive (NAR) decoding approach to solve the above problems. Experiments on 10k-hour WenetSpeech Mandarin corpus show that our approach decreases 12.2% and 9.6% CER relatively on Test Net and Test Meeting evaluation sets compared with baseline. Notably, we reduce the decoding repetition rate on the evaluation set to zero, showing that the decoding repetition problem has been solved fundamentally.

KW - audio-LLM

KW - decoding repetition

KW - hallucination of LLM

KW - speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85214838819&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2024-968

DO - 10.21437/Interspeech.2024-968

M3 - 会议文章

AN - SCOPUS:85214838819

SN - 2308-457X

SP - 1905

EP - 1909

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 25th Interspeech Conferece 2024

Y2 - 1 September 2024 through 5 September 2024

ER -

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

摘要

访问文件

其它文件与链接

指纹

引用此