Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition

Tianyi Xu; Zhanheng Yang; Kaixun Huang; Pengcheng Guo; Ao Zhang; Biao Li; Changru Chen; Chao Li; Lei Xie

doi:10.21437/Interspeech.2023-884

Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition

Tianyi Xu, Zhanheng Yang, Kaixun Huang, Pengcheng Guo, Ao Zhang, Biao Li, Changru Chen, Chao Li, Lei Xie

计算机学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 会议文章 › 同行评审

8 引用（Scopus）

摘要

By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual biasing method based on Context-Aware Transformer Transducer (CATT) that utilizes the biased encoder and predictor embeddings to perform streaming prediction of contextual phrase occurrences. Such prediction is then used to dynamically switch the bias list on and off, enabling the model to adapt to both personalized and common scenarios. Experiments on Librispeech and internal voice assistant datasets show that our approach can achieve up to 6.7% and 20.7% relative reduction in WER and CER compared to the baseline respectively, mitigating up to 96.7% and 84.9% of the relative WER and CER increase for common cases. Furthermore, our approach has a minimal performance impact in personalized scenarios while maintaining a streaming inference pipeline with negligible RTF increase.

源语言	英语
页（从-至）	1668-1672
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷	2023-August
DOI	https://doi.org/10.21437/Interspeech.2023-884
出版状态	已出版 - 2023
活动	24th International Speech Communication Association, Interspeech 2023 - Dublin, 爱尔兰期限: 20 8月 2023 → 24 8月 2023

访问文件

10.21437/Interspeech.2023-884

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{99d4781903294cfb9f39027e58abc6b8,

title = "Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition",

abstract = "By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual biasing method based on Context-Aware Transformer Transducer (CATT) that utilizes the biased encoder and predictor embeddings to perform streaming prediction of contextual phrase occurrences. Such prediction is then used to dynamically switch the bias list on and off, enabling the model to adapt to both personalized and common scenarios. Experiments on Librispeech and internal voice assistant datasets show that our approach can achieve up to 6.7% and 20.7% relative reduction in WER and CER compared to the baseline respectively, mitigating up to 96.7% and 84.9% of the relative WER and CER increase for common cases. Furthermore, our approach has a minimal performance impact in personalized scenarios while maintaining a streaming inference pipeline with negligible RTF increase.",

keywords = "Context-Aware Training, Contextual List Filtering, End-to-end Speech Recognition, RNN-T",

author = "Tianyi Xu and Zhanheng Yang and Kaixun Huang and Pengcheng Guo and Ao Zhang and Biao Li and Changru Chen and Chao Li and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2023 International Speech Communication Association. All rights reserved.; 24th International Speech Communication Association, Interspeech 2023 ; Conference date: 20-08-2023 Through 24-08-2023",

year = "2023",

doi = "10.21437/Interspeech.2023-884",

language = "英语",

volume = "2023-August",

pages = "1668--1672",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition

AU - Xu, Tianyi

AU - Yang, Zhanheng

AU - Huang, Kaixun

AU - Guo, Pengcheng

AU - Zhang, Ao

AU - Li, Biao

AU - Chen, Changru

AU - Li, Chao

AU - Xie, Lei

PY - 2023

Y1 - 2023

N2 - By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual biasing method based on Context-Aware Transformer Transducer (CATT) that utilizes the biased encoder and predictor embeddings to perform streaming prediction of contextual phrase occurrences. Such prediction is then used to dynamically switch the bias list on and off, enabling the model to adapt to both personalized and common scenarios. Experiments on Librispeech and internal voice assistant datasets show that our approach can achieve up to 6.7% and 20.7% relative reduction in WER and CER compared to the baseline respectively, mitigating up to 96.7% and 84.9% of the relative WER and CER increase for common cases. Furthermore, our approach has a minimal performance impact in personalized scenarios while maintaining a streaming inference pipeline with negligible RTF increase.

AB - By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual biasing method based on Context-Aware Transformer Transducer (CATT) that utilizes the biased encoder and predictor embeddings to perform streaming prediction of contextual phrase occurrences. Such prediction is then used to dynamically switch the bias list on and off, enabling the model to adapt to both personalized and common scenarios. Experiments on Librispeech and internal voice assistant datasets show that our approach can achieve up to 6.7% and 20.7% relative reduction in WER and CER compared to the baseline respectively, mitigating up to 96.7% and 84.9% of the relative WER and CER increase for common cases. Furthermore, our approach has a minimal performance impact in personalized scenarios while maintaining a streaming inference pipeline with negligible RTF increase.

KW - Context-Aware Training

KW - Contextual List Filtering

KW - End-to-end Speech Recognition

KW - RNN-T

UR - http://www.scopus.com/inward/record.url?scp=85171564567&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2023-884

DO - 10.21437/Interspeech.2023-884

M3 - 会议文章

AN - SCOPUS:85171564567

SN - 2308-457X

VL - 2023-August

SP - 1668

EP - 1672

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 24th International Speech Communication Association, Interspeech 2023

Y2 - 20 August 2023 through 24 August 2023

ER -

Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition

摘要

访问文件

其它文件与链接

指纹

引用此