TY - JOUR
T1 - Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network
AU - Huang, Kaixun
AU - Zhang, Ao
AU - Yang, Zhanheng
AU - Guo, Pengcheng
AU - Mu, Bingshen
AU - Xu, Tianyi
AU - Xie, Lei
N1 - Publisher Copyright:
© 2023 International Speech Communication Association. All rights reserved.
PY - 2023
Y1 - 2023
N2 - Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list.
AB - Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list.
KW - Contextual List Filter
KW - Deep Biasing
KW - End-to-end Speech Recognition
UR - http://www.scopus.com/inward/record.url?scp=85171554882&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2023-767
DO - 10.21437/Interspeech.2023-767
M3 - 会议文章
AN - SCOPUS:85171554882
SN - 2308-457X
VL - 2023-August
SP - 4933
EP - 4937
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 24th International Speech Communication Association, Interspeech 2023
Y2 - 20 August 2023 through 24 August 2023
ER -