WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit

Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, Xin Lei

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

144 Scopus citations

Abstract

In this paper, we propose an open source speech recognition toolkit called WeNet, in which a new two-pass approach named U2 is implemented to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. The main motivation of WeNet is to close the gap between the research and deployment of E2E speech recognition models. WeNet provides an efficient way to ship automatic speech recognition (ASR) applications in real-world scenarios, which is the main difference and advantage to other open source E2E speech recognition toolkits. We develop a hybird connectionist temporal classification (CTC)/attention architecture with transformer or conformer as encoder and an attention decoder to rescore th CTC hypotheses. To achieve streaming and non-streaming in a unified model, we use a dynamic chunk-based attention strategy which allows the self-attention to focus on the right context with random length. Our experiments on the AISHELL-1 dataset show that our model achieves 5.03% relative character error rate (CER) reduction in non-streaming ASR compared to a standard non-streaming transformer. After model quantification, our model achieves reasonable RTF and latency at runtime. The toolkit is publicly available at https://github.com/mobvoi/wenet.

Original languageEnglish
Title of host publication22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PublisherInternational Speech Communication Association
Pages2093-2097
Number of pages5
ISBN (Electronic)9781713836902
DOIs
StatePublished - 2021
Event22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 - Brno, Czech Republic
Duration: 30 Aug 20213 Sep 2021

Publication series

NameProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume3
ISSN (Print)2308-457X
ISSN (Electronic)1990-9772

Conference

Conference22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Country/TerritoryCzech Republic
CityBrno
Period30/08/213/09/21

Keywords

  • Production oriented
  • U2
  • WeNet

Fingerprint

Dive into the research topics of 'WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit'. Together they form a unique fingerprint.

Cite this