Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network

  • Kaixun Huang
  • , Ao Zhang
  • , Zhanheng Yang
  • , Pengcheng Guo
  • , Bingshen Mu
  • , Tianyi Xu
  • , Lei Xie

Research output: Contribution to journalConference articlepeer-review

26 Scopus citations

Abstract

Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list.

Original languageEnglish
Pages (from-to)4933-4937
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2023-August
DOIs
StatePublished - 2023
Event24th Annual conference of the International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023

Keywords

  • Contextual List Filter
  • Deep Biasing
  • End-to-end Speech Recognition

Fingerprint

Dive into the research topics of 'Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network'. Together they form a unique fingerprint.

Cite this