Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network

Kaixun Huang; Ao Zhang; Zhanheng Yang; Pengcheng Guo; Bingshen Mu; Tianyi Xu; Lei Xie

doi:10.21437/Interspeech.2023-767

Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network

Kaixun Huang, Ao Zhang, Zhanheng Yang, Pengcheng Guo, Bingshen Mu, Tianyi Xu, Lei Xie

计算机学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 会议文章 › 同行评审

15 引用（Scopus）

摘要

Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list.

源语言	英语
页（从-至）	4933-4937
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷	2023-August
DOI	https://doi.org/10.21437/Interspeech.2023-767
出版状态	已出版 - 2023
活动	24th International Speech Communication Association, Interspeech 2023 - Dublin, 爱尔兰期限: 20 8月 2023 → 24 8月 2023

访问文件

10.21437/Interspeech.2023-767

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{21645c7e929049d98a85322b93db6352,

title = "Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network",

abstract = "Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list.",

keywords = "Contextual List Filter, Deep Biasing, End-to-end Speech Recognition",

author = "Kaixun Huang and Ao Zhang and Zhanheng Yang and Pengcheng Guo and Bingshen Mu and Tianyi Xu and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2023 International Speech Communication Association. All rights reserved.; 24th International Speech Communication Association, Interspeech 2023 ; Conference date: 20-08-2023 Through 24-08-2023",

year = "2023",

doi = "10.21437/Interspeech.2023-767",

language = "英语",

volume = "2023-August",

pages = "4933--4937",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network

AU - Huang, Kaixun

AU - Zhang, Ao

AU - Yang, Zhanheng

AU - Guo, Pengcheng

AU - Mu, Bingshen

AU - Xu, Tianyi

AU - Xie, Lei

PY - 2023

Y1 - 2023

N2 - Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list.

AB - Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list.

KW - Contextual List Filter

KW - Deep Biasing

KW - End-to-end Speech Recognition

UR - http://www.scopus.com/inward/record.url?scp=85171554882&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2023-767

DO - 10.21437/Interspeech.2023-767

M3 - 会议文章

AN - SCOPUS:85171554882

SN - 2308-457X

VL - 2023-August

SP - 4933

EP - 4937

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 24th International Speech Communication Association, Interspeech 2023

Y2 - 20 August 2023 through 24 August 2023

ER -

Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network

摘要

访问文件

其它文件与链接

指纹

引用此