Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation

  • Zhennan Lin
  • , Kaixun Huang
  • , Wei Ren
  • , Linju Yang
  • , Lei Xie

Research output: Contribution to journalConference articlepeer-review

Abstract

Deep biasing improves automatic speech recognition (ASR) performance by incorporating contextual phrases. However, most existing methods enhance subwords in a contextual phrase as independent units, potentially compromising contextual phrase integrity, leading to accuracy reduction. In this paper, we propose an encoder-based phrase-level contextualized ASR method that leverages dynamic vocabulary prediction and activation. We introduce architectural optimizations and integrate a bias loss to extend phrase-level predictions based on frame-level outputs. We also introduce a confidence-activated decoding method that ensures the complete output of contextual phrases while suppressing incorrect bias. Experiments on Librispeech and Wenetspeech datasets demonstrate that our approach achieves relative WER reductions of 28.31% and 23.49% compared to baseline, with the WER on contextual phrases decreasing relatively by 72.04% and 75.69%.

Original languageEnglish
Pages (from-to)3174-3178
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
StatePublished - 2025
Event26th Interspeech Conference 2025 - Rotterdam, Netherlands
Duration: 17 Aug 202521 Aug 2025

Keywords

  • contextualization
  • dynamic vocabulary prediction and activation
  • speech recognition

Fingerprint

Dive into the research topics of 'Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation'. Together they form a unique fingerprint.

Cite this