Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition

Tianyi Xu; Zhanheng Yang; Kaixun Huang; Pengcheng Guo; Ao Zhang; Biao Li; Changru Chen; Chao Li; Lei Xie

doi:10.21437/Interspeech.2023-884

Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition

Tianyi Xu, Zhanheng Yang, Kaixun Huang, Pengcheng Guo, Ao Zhang, Biao Li, Changru Chen, Chao Li, Lei Xie

School of Computer Science

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Conference article › peer-review

8 Scopus citations

Abstract

By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual biasing method based on Context-Aware Transformer Transducer (CATT) that utilizes the biased encoder and predictor embeddings to perform streaming prediction of contextual phrase occurrences. Such prediction is then used to dynamically switch the bias list on and off, enabling the model to adapt to both personalized and common scenarios. Experiments on Librispeech and internal voice assistant datasets show that our approach can achieve up to 6.7% and 20.7% relative reduction in WER and CER compared to the baseline respectively, mitigating up to 96.7% and 84.9% of the relative WER and CER increase for common cases. Furthermore, our approach has a minimal performance impact in personalized scenarios while maintaining a streaming inference pipeline with negligible RTF increase.

Original language	English
Pages (from-to)	1668-1672
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2023-August
DOIs	https://doi.org/10.21437/Interspeech.2023-884
State	Published - 2023
Event	24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland Duration: 20 Aug 2023 → 24 Aug 2023

Keywords

Context-Aware Training
Contextual List Filtering
End-to-end Speech Recognition
RNN-T

Access to Document

10.21437/Interspeech.2023-884

Cite this

@article{99d4781903294cfb9f39027e58abc6b8,

title = "Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition",

abstract = "By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual biasing method based on Context-Aware Transformer Transducer (CATT) that utilizes the biased encoder and predictor embeddings to perform streaming prediction of contextual phrase occurrences. Such prediction is then used to dynamically switch the bias list on and off, enabling the model to adapt to both personalized and common scenarios. Experiments on Librispeech and internal voice assistant datasets show that our approach can achieve up to 6.7% and 20.7% relative reduction in WER and CER compared to the baseline respectively, mitigating up to 96.7% and 84.9% of the relative WER and CER increase for common cases. Furthermore, our approach has a minimal performance impact in personalized scenarios while maintaining a streaming inference pipeline with negligible RTF increase.",

keywords = "Context-Aware Training, Contextual List Filtering, End-to-end Speech Recognition, RNN-T",

author = "Tianyi Xu and Zhanheng Yang and Kaixun Huang and Pengcheng Guo and Ao Zhang and Biao Li and Changru Chen and Chao Li and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2023 International Speech Communication Association. All rights reserved.; 24th International Speech Communication Association, Interspeech 2023 ; Conference date: 20-08-2023 Through 24-08-2023",

year = "2023",

doi = "10.21437/Interspeech.2023-884",

language = "英语",

volume = "2023-August",

pages = "1668--1672",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition

AU - Xu, Tianyi

AU - Yang, Zhanheng

AU - Huang, Kaixun

AU - Guo, Pengcheng

AU - Zhang, Ao

AU - Li, Biao

AU - Chen, Changru

AU - Li, Chao

AU - Xie, Lei

PY - 2023

Y1 - 2023

N2 - By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual biasing method based on Context-Aware Transformer Transducer (CATT) that utilizes the biased encoder and predictor embeddings to perform streaming prediction of contextual phrase occurrences. Such prediction is then used to dynamically switch the bias list on and off, enabling the model to adapt to both personalized and common scenarios. Experiments on Librispeech and internal voice assistant datasets show that our approach can achieve up to 6.7% and 20.7% relative reduction in WER and CER compared to the baseline respectively, mitigating up to 96.7% and 84.9% of the relative WER and CER increase for common cases. Furthermore, our approach has a minimal performance impact in personalized scenarios while maintaining a streaming inference pipeline with negligible RTF increase.

AB - By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual biasing method based on Context-Aware Transformer Transducer (CATT) that utilizes the biased encoder and predictor embeddings to perform streaming prediction of contextual phrase occurrences. Such prediction is then used to dynamically switch the bias list on and off, enabling the model to adapt to both personalized and common scenarios. Experiments on Librispeech and internal voice assistant datasets show that our approach can achieve up to 6.7% and 20.7% relative reduction in WER and CER compared to the baseline respectively, mitigating up to 96.7% and 84.9% of the relative WER and CER increase for common cases. Furthermore, our approach has a minimal performance impact in personalized scenarios while maintaining a streaming inference pipeline with negligible RTF increase.

KW - Context-Aware Training

KW - Contextual List Filtering

KW - End-to-end Speech Recognition

KW - RNN-T

UR - http://www.scopus.com/inward/record.url?scp=85171564567&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2023-884

DO - 10.21437/Interspeech.2023-884

M3 - 会议文章

AN - SCOPUS:85171564567

SN - 2308-457X

VL - 2023-August

SP - 1668

EP - 1672

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 24th International Speech Communication Association, Interspeech 2023

Y2 - 20 August 2023 through 24 August 2023

ER -

Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this