WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit

Zhuoyuan Yao; Di Wu; Xiong Wang; Binbin Zhang; Fan Yu; Chao Yang; Zhendong Peng; Xiaoyu Chen; Lei Xie; Xin Lei

doi:10.21437/Interspeech.2021-1983

WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit

Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, Xin Lei

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

143 Scopus citations

Abstract

In this paper, we propose an open source speech recognition toolkit called WeNet, in which a new two-pass approach named U2 is implemented to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. The main motivation of WeNet is to close the gap between the research and deployment of E2E speech recognition models. WeNet provides an efficient way to ship automatic speech recognition (ASR) applications in real-world scenarios, which is the main difference and advantage to other open source E2E speech recognition toolkits. We develop a hybird connectionist temporal classification (CTC)/attention architecture with transformer or conformer as encoder and an attention decoder to rescore th CTC hypotheses. To achieve streaming and non-streaming in a unified model, we use a dynamic chunk-based attention strategy which allows the self-attention to focus on the right context with random length. Our experiments on the AISHELL-1 dataset show that our model achieves 5.03% relative character error rate (CER) reduction in non-streaming ASR compared to a standard non-streaming transformer. After model quantification, our model achieves reasonable RTF and latency at runtime. The toolkit is publicly available at https://github.com/mobvoi/wenet.

Original language	English
Title of host publication	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Publisher	International Speech Communication Association
Pages	2093-2097
Number of pages	5
ISBN (Electronic)	9781713836902
DOIs	https://doi.org/10.21437/Interspeech.2021-1983
State	Published - 2021
Event	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 - Brno, Czech Republic Duration: 30 Aug 2021 → 3 Sep 2021

Publication series

Name	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	3
ISSN (Print)	2308-457X
ISSN (Electronic)	1990-9772

Conference

Conference	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Country/Territory	Czech Republic
City	Brno
Period	30/08/21 → 3/09/21

Keywords

Production oriented
U2
WeNet

Access to Document

10.21437/Interspeech.2021-1983

Cite this

Yao, Z., Wu, D., Wang, X., Zhang, B., Yu, F., Yang, C., Peng, Z., Chen, X., Xie, L., & Lei, X. (2021). WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. In 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 (pp. 2093-2097). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 3). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2021-1983

Yao, Zhuoyuan ; Wu, Di ; Wang, Xiong et al. / WeNet : Production oriented streaming and non-streaming end-to-end speech recognition toolkit. 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association, 2021. pp. 2093-2097 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH).

@inproceedings{b4be69a0a9d6497c9247607175c496ab,

title = "WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit",

abstract = "In this paper, we propose an open source speech recognition toolkit called WeNet, in which a new two-pass approach named U2 is implemented to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. The main motivation of WeNet is to close the gap between the research and deployment of E2E speech recognition models. WeNet provides an efficient way to ship automatic speech recognition (ASR) applications in real-world scenarios, which is the main difference and advantage to other open source E2E speech recognition toolkits. We develop a hybird connectionist temporal classification (CTC)/attention architecture with transformer or conformer as encoder and an attention decoder to rescore th CTC hypotheses. To achieve streaming and non-streaming in a unified model, we use a dynamic chunk-based attention strategy which allows the self-attention to focus on the right context with random length. Our experiments on the AISHELL-1 dataset show that our model achieves 5.03% relative character error rate (CER) reduction in non-streaming ASR compared to a standard non-streaming transformer. After model quantification, our model achieves reasonable RTF and latency at runtime. The toolkit is publicly available at https://github.com/mobvoi/wenet.",

keywords = "Production oriented, U2, WeNet",

author = "Zhuoyuan Yao and Di Wu and Xiong Wang and Binbin Zhang and Fan Yu and Chao Yang and Zhendong Peng and Xiaoyu Chen and Lei Xie and Xin Lei",

note = "Publisher Copyright: Copyright {\textcopyright} 2021 ISCA.; 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 ; Conference date: 30-08-2021 Through 03-09-2021",

year = "2021",

doi = "10.21437/Interspeech.2021-1983",

language = "英语",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

publisher = "International Speech Communication Association",

pages = "2093--2097",

booktitle = "22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021",

}

Yao, Z, Wu, D, Wang, X, Zhang, B, Yu, F, Yang, C, Peng, Z, Chen, X, Xie, L & Lei, X 2021, WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. in 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 3, International Speech Communication Association, pp. 2093-2097, 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, Brno, Czech Republic, 30/08/21. https://doi.org/10.21437/Interspeech.2021-1983

WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. / Yao, Zhuoyuan; Wu, Di; Wang, Xiong et al.
22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association, 2021. p. 2093-2097 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 3).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - WeNet

T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021

AU - Yao, Zhuoyuan

AU - Wu, Di

AU - Wang, Xiong

AU - Zhang, Binbin

AU - Yu, Fan

AU - Yang, Chao

AU - Peng, Zhendong

AU - Chen, Xiaoyu

AU - Xie, Lei

AU - Lei, Xin

PY - 2021

Y1 - 2021

N2 - In this paper, we propose an open source speech recognition toolkit called WeNet, in which a new two-pass approach named U2 is implemented to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. The main motivation of WeNet is to close the gap between the research and deployment of E2E speech recognition models. WeNet provides an efficient way to ship automatic speech recognition (ASR) applications in real-world scenarios, which is the main difference and advantage to other open source E2E speech recognition toolkits. We develop a hybird connectionist temporal classification (CTC)/attention architecture with transformer or conformer as encoder and an attention decoder to rescore th CTC hypotheses. To achieve streaming and non-streaming in a unified model, we use a dynamic chunk-based attention strategy which allows the self-attention to focus on the right context with random length. Our experiments on the AISHELL-1 dataset show that our model achieves 5.03% relative character error rate (CER) reduction in non-streaming ASR compared to a standard non-streaming transformer. After model quantification, our model achieves reasonable RTF and latency at runtime. The toolkit is publicly available at https://github.com/mobvoi/wenet.

AB - In this paper, we propose an open source speech recognition toolkit called WeNet, in which a new two-pass approach named U2 is implemented to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. The main motivation of WeNet is to close the gap between the research and deployment of E2E speech recognition models. WeNet provides an efficient way to ship automatic speech recognition (ASR) applications in real-world scenarios, which is the main difference and advantage to other open source E2E speech recognition toolkits. We develop a hybird connectionist temporal classification (CTC)/attention architecture with transformer or conformer as encoder and an attention decoder to rescore th CTC hypotheses. To achieve streaming and non-streaming in a unified model, we use a dynamic chunk-based attention strategy which allows the self-attention to focus on the right context with random length. Our experiments on the AISHELL-1 dataset show that our model achieves 5.03% relative character error rate (CER) reduction in non-streaming ASR compared to a standard non-streaming transformer. After model quantification, our model achieves reasonable RTF and latency at runtime. The toolkit is publicly available at https://github.com/mobvoi/wenet.

KW - Production oriented

KW - U2

KW - WeNet

UR - http://www.scopus.com/inward/record.url?scp=85119212755&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2021-1983

DO - 10.21437/Interspeech.2021-1983

M3 - 会议稿件

AN - SCOPUS:85119212755

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 2093

EP - 2097

BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021

PB - International Speech Communication Association

Y2 - 30 August 2021 through 3 September 2021

ER -

Yao Z, Wu D, Wang X, Zhang B, Yu F, Yang C et al. WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. In 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association. 2021. p. 2093-2097. (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH). doi: 10.21437/Interspeech.2021-1983

WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this