WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit

Binbin Zhang; Di Wu; Zhendong Peng; Xingchen Song; Zhuoyuan Yao; Hang Lv; Lei Xie; Chao Yang; Fuping Pan; Jianwei Niu

doi:10.21437/Interspeech.2022-483

WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit

Binbin Zhang, Di Wu, Zhendong Peng, Xingchen Song, Zhuoyuan Yao, Hang Lv, Lei Xie, Chao Yang, Fuping Pan, Jianwei Niu

School of Computer Science

Research output: Contribution to journal › Conference article › peer-review

38 Scopus citations

Abstract

Recently, we made available WeNet [1], a production-oriented end-to-end speech recognition toolkit, which introduces a unified two-pass (U2) framework and a built-in runtime to address the streaming and non-streaming decoding modes in a single model. To further improve ASR performance and facilitate various production requirements, in this paper, we present WeNet 2.0 with four important updates. (1) We propose U2++, a unified two-pass framework with bidirectional attention decoders, which includes the future contextual information by a right-to-left attention decoder to improve the representative ability of the shared encoder and the performance during the rescoring stage. (2) We introduce an n-gram based language model and a WFST-based decoder into WeNet 2.0, promoting the use of rich text data in production scenarios. (3) We design a unified contextual biasing framework, which leverages user-specific context (e.g., contact lists) to provide rapid adaptation ability for production and improves ASR accuracy in both with-LM and without-LM scenarios. (4) We design a unified IO to support large-scale data for effective model training. In summary, the brand-new WeNet 2.0 achieves up to 10% relative recognition performance improvement over the original WeNet on various corpora and makes available several important production-oriented features.

Original language	English
Pages (from-to)	1661-1665
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2022-September
DOIs	https://doi.org/10.21437/Interspeech.2022-483
State	Published - 2022
Event	23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of Duration: 18 Sep 2022 → 22 Sep 2022

Keywords

Contextual Biasing
Language Model
Toolkit
U2++
UIO

Access to Document

10.21437/Interspeech.2022-483

Cite this

@article{43ace1ae7312452c819f0a1d32f13c1f,

title = "WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit",

abstract = "Recently, we made available WeNet [1], a production-oriented end-to-end speech recognition toolkit, which introduces a unified two-pass (U2) framework and a built-in runtime to address the streaming and non-streaming decoding modes in a single model. To further improve ASR performance and facilitate various production requirements, in this paper, we present WeNet 2.0 with four important updates. (1) We propose U2++, a unified two-pass framework with bidirectional attention decoders, which includes the future contextual information by a right-to-left attention decoder to improve the representative ability of the shared encoder and the performance during the rescoring stage. (2) We introduce an n-gram based language model and a WFST-based decoder into WeNet 2.0, promoting the use of rich text data in production scenarios. (3) We design a unified contextual biasing framework, which leverages user-specific context (e.g., contact lists) to provide rapid adaptation ability for production and improves ASR accuracy in both with-LM and without-LM scenarios. (4) We design a unified IO to support large-scale data for effective model training. In summary, the brand-new WeNet 2.0 achieves up to 10% relative recognition performance improvement over the original WeNet on various corpora and makes available several important production-oriented features.",

keywords = "Contextual Biasing, Language Model, Toolkit, U2++, UIO",

author = "Binbin Zhang and Di Wu and Zhendong Peng and Xingchen Song and Zhuoyuan Yao and Hang Lv and Lei Xie and Chao Yang and Fuping Pan and Jianwei Niu",

note = "Publisher Copyright: Copyright {\textcopyright} 2022 ISCA.; 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 ; Conference date: 18-09-2022 Through 22-09-2022",

year = "2022",

doi = "10.21437/Interspeech.2022-483",

language = "英语",

volume = "2022-September",

pages = "1661--1665",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - WeNet 2.0

T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022

AU - Zhang, Binbin

AU - Wu, Di

AU - Peng, Zhendong

AU - Song, Xingchen

AU - Yao, Zhuoyuan

AU - Lv, Hang

AU - Xie, Lei

AU - Yang, Chao

AU - Pan, Fuping

AU - Niu, Jianwei

PY - 2022

Y1 - 2022

N2 - Recently, we made available WeNet [1], a production-oriented end-to-end speech recognition toolkit, which introduces a unified two-pass (U2) framework and a built-in runtime to address the streaming and non-streaming decoding modes in a single model. To further improve ASR performance and facilitate various production requirements, in this paper, we present WeNet 2.0 with four important updates. (1) We propose U2++, a unified two-pass framework with bidirectional attention decoders, which includes the future contextual information by a right-to-left attention decoder to improve the representative ability of the shared encoder and the performance during the rescoring stage. (2) We introduce an n-gram based language model and a WFST-based decoder into WeNet 2.0, promoting the use of rich text data in production scenarios. (3) We design a unified contextual biasing framework, which leverages user-specific context (e.g., contact lists) to provide rapid adaptation ability for production and improves ASR accuracy in both with-LM and without-LM scenarios. (4) We design a unified IO to support large-scale data for effective model training. In summary, the brand-new WeNet 2.0 achieves up to 10% relative recognition performance improvement over the original WeNet on various corpora and makes available several important production-oriented features.

AB - Recently, we made available WeNet [1], a production-oriented end-to-end speech recognition toolkit, which introduces a unified two-pass (U2) framework and a built-in runtime to address the streaming and non-streaming decoding modes in a single model. To further improve ASR performance and facilitate various production requirements, in this paper, we present WeNet 2.0 with four important updates. (1) We propose U2++, a unified two-pass framework with bidirectional attention decoders, which includes the future contextual information by a right-to-left attention decoder to improve the representative ability of the shared encoder and the performance during the rescoring stage. (2) We introduce an n-gram based language model and a WFST-based decoder into WeNet 2.0, promoting the use of rich text data in production scenarios. (3) We design a unified contextual biasing framework, which leverages user-specific context (e.g., contact lists) to provide rapid adaptation ability for production and improves ASR accuracy in both with-LM and without-LM scenarios. (4) We design a unified IO to support large-scale data for effective model training. In summary, the brand-new WeNet 2.0 achieves up to 10% relative recognition performance improvement over the original WeNet on various corpora and makes available several important production-oriented features.

KW - Contextual Biasing

KW - Language Model

KW - Toolkit

KW - U2++

KW - UIO

UR - http://www.scopus.com/inward/record.url?scp=85140071056&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2022-483

DO - 10.21437/Interspeech.2022-483

M3 - 会议文章

AN - SCOPUS:85140071056

SN - 2308-457X

VL - 2022-September

SP - 1661

EP - 1665

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Y2 - 18 September 2022 through 22 September 2022

ER -

WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this