TY - JOUR
T1 - Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition
AU - Xu, Tianyi
AU - Yang, Zhanheng
AU - Huang, Kaixun
AU - Guo, Pengcheng
AU - Zhang, Ao
AU - Li, Biao
AU - Chen, Changru
AU - Li, Chao
AU - Xie, Lei
N1 - Publisher Copyright:
© 2023 International Speech Communication Association. All rights reserved.
PY - 2023
Y1 - 2023
N2 - By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual biasing method based on Context-Aware Transformer Transducer (CATT) that utilizes the biased encoder and predictor embeddings to perform streaming prediction of contextual phrase occurrences. Such prediction is then used to dynamically switch the bias list on and off, enabling the model to adapt to both personalized and common scenarios. Experiments on Librispeech and internal voice assistant datasets show that our approach can achieve up to 6.7% and 20.7% relative reduction in WER and CER compared to the baseline respectively, mitigating up to 96.7% and 84.9% of the relative WER and CER increase for common cases. Furthermore, our approach has a minimal performance impact in personalized scenarios while maintaining a streaming inference pipeline with negligible RTF increase.
AB - By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual biasing method based on Context-Aware Transformer Transducer (CATT) that utilizes the biased encoder and predictor embeddings to perform streaming prediction of contextual phrase occurrences. Such prediction is then used to dynamically switch the bias list on and off, enabling the model to adapt to both personalized and common scenarios. Experiments on Librispeech and internal voice assistant datasets show that our approach can achieve up to 6.7% and 20.7% relative reduction in WER and CER compared to the baseline respectively, mitigating up to 96.7% and 84.9% of the relative WER and CER increase for common cases. Furthermore, our approach has a minimal performance impact in personalized scenarios while maintaining a streaming inference pipeline with negligible RTF increase.
KW - Context-Aware Training
KW - Contextual List Filtering
KW - End-to-end Speech Recognition
KW - RNN-T
UR - http://www.scopus.com/inward/record.url?scp=85171564567&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2023-884
DO - 10.21437/Interspeech.2023-884
M3 - 会议文章
AN - SCOPUS:85171564567
SN - 2308-457X
VL - 2023-August
SP - 1668
EP - 1672
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 24th International Speech Communication Association, Interspeech 2023
Y2 - 20 August 2023 through 24 August 2023
ER -