Skip to main navigation Skip to search Skip to main content

Two-Stage Contextual Word Filtering for Context Bias in Unified Streaming and Non-streaming Transducer

  • Zhanheng Yang
  • , Sining Sun
  • , Xiong Wang
  • , Yike Zhang
  • , Long Ma
  • , Lei Xie
  • Northwestern Polytechnical University Xian
  • Tencent

Research output: Contribution to journalConference articlepeer-review

7 Scopus citations

Abstract

It is difficult for an E2E ASR system to recognize words such as entities appearing infrequently in the training data. A widely used method to mitigate this issue is feeding contextual information into the acoustic model. Previous works have proven that a compact and accurate contextual list can boost the performance significantly. In this paper, we propose an efficient approach to obtain a high quality contextual list for a unified streaming/non-streaming based E2E model. Specifically, we make use of the phone-level streaming output to first filter the predefined contextual word list then fuse it into non-casual encoder and decoder to generate the final recognition results. Our approach improve the accuracy of the contextual ASR system and speed up the inference process. Experiments on two datasets demonstrates over 20% CER reduction comparing to the baseline system. Meanwhile, the RTF of our system can be stabilized within 0.15 when the size of the contextual word list grows over 6,000.

Original languageEnglish
Pages (from-to)3257-3261
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2023-August
DOIs
StatePublished - 2023
Event24th Annual conference of the International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023

Keywords

  • RNN-T
  • Speech recognition
  • attention
  • context bias
  • context-aware training
  • transducer

Fingerprint

Dive into the research topics of 'Two-Stage Contextual Word Filtering for Context Bias in Unified Streaming and Non-streaming Transducer'. Together they form a unique fingerprint.

Cite this