Streaming chunk-aware multihead attention for online end-to-end speech recognition

Shiliang Zhang; Zhifu Gao; Haoneng Luo; Ming Lei; Jie Gao; Zhijie Yan; Lei Xie

doi:10.21437/Interspeech.2020-1972

Streaming chunk-aware multihead attention for online end-to-end speech recognition

Shiliang Zhang, Zhifu Gao, Haoneng Luo, Ming Lei, Jie Gao, Zhijie Yan, Lei Xie

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

14 引用（Scopus）

摘要

Recently, streaming end-to-end automatic speech recognition (E2E-ASR) has gained more and more attention. Many efforts have been paid to turn the non-streaming attention-based E2E-ASR system into streaming architecture. In this work, we propose a novel online E2E-ASR system by using Streaming Chunk-Aware Multihead Attention (SCAMA) and a latency control memory equipped self-attention network (LC-SAN-M). LC-SAN-M uses chunk-level input to control the latency of encoder. As to SCAMA, a jointly trained predictor is used to control the output of encoder when feeding to decoder, which enables decoder to generate output in streaming manner. Experimental results on the open 170-hour AISHELL-1 and an industrial-level 20000-hour Mandarin speech recognition tasks show that our approach can significantly outperform the MoChA-based baseline system under comparable setup. On the AISHELL-1 task, our proposed method achieves a character error rate (CER) of 7.39%, to the best of our knowledge, which is the best published performance for online ASR.

源语言	英语
页（从-至）	2142-2146
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷	2020-October
DOI	https://doi.org/10.21437/Interspeech.2020-1972
出版状态	已出版 - 2020
活动	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, 中国期限: 25 10月 2020 → 29 10月 2020

访问文件

10.21437/Interspeech.2020-1972

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{6e57b792c6b7446592f41ce367c38235,

title = "Streaming chunk-aware multihead attention for online end-to-end speech recognition",

abstract = "Recently, streaming end-to-end automatic speech recognition (E2E-ASR) has gained more and more attention. Many efforts have been paid to turn the non-streaming attention-based E2E-ASR system into streaming architecture. In this work, we propose a novel online E2E-ASR system by using Streaming Chunk-Aware Multihead Attention (SCAMA) and a latency control memory equipped self-attention network (LC-SAN-M). LC-SAN-M uses chunk-level input to control the latency of encoder. As to SCAMA, a jointly trained predictor is used to control the output of encoder when feeding to decoder, which enables decoder to generate output in streaming manner. Experimental results on the open 170-hour AISHELL-1 and an industrial-level 20000-hour Mandarin speech recognition tasks show that our approach can significantly outperform the MoChA-based baseline system under comparable setup. On the AISHELL-1 task, our proposed method achieves a character error rate (CER) of 7.39%, to the best of our knowledge, which is the best published performance for online ASR.",

keywords = "Automatic Speech Recognition, End-to-End, LC-SAN-M, Online ASR, SCAMA",

author = "Shiliang Zhang and Zhifu Gao and Haoneng Luo and Ming Lei and Jie Gao and Zhijie Yan and Lei Xie",

note = "Publisher Copyright: Copyright {\textcopyright} 2020 ISCA; 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 ; Conference date: 25-10-2020 Through 29-10-2020",

year = "2020",

doi = "10.21437/Interspeech.2020-1972",

language = "英语",

volume = "2020-October",

pages = "2142--2146",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Streaming chunk-aware multihead attention for online end-to-end speech recognition

AU - Zhang, Shiliang

AU - Gao, Zhifu

AU - Luo, Haoneng

AU - Lei, Ming

AU - Gao, Jie

AU - Yan, Zhijie

AU - Xie, Lei

PY - 2020

Y1 - 2020

N2 - Recently, streaming end-to-end automatic speech recognition (E2E-ASR) has gained more and more attention. Many efforts have been paid to turn the non-streaming attention-based E2E-ASR system into streaming architecture. In this work, we propose a novel online E2E-ASR system by using Streaming Chunk-Aware Multihead Attention (SCAMA) and a latency control memory equipped self-attention network (LC-SAN-M). LC-SAN-M uses chunk-level input to control the latency of encoder. As to SCAMA, a jointly trained predictor is used to control the output of encoder when feeding to decoder, which enables decoder to generate output in streaming manner. Experimental results on the open 170-hour AISHELL-1 and an industrial-level 20000-hour Mandarin speech recognition tasks show that our approach can significantly outperform the MoChA-based baseline system under comparable setup. On the AISHELL-1 task, our proposed method achieves a character error rate (CER) of 7.39%, to the best of our knowledge, which is the best published performance for online ASR.

AB - Recently, streaming end-to-end automatic speech recognition (E2E-ASR) has gained more and more attention. Many efforts have been paid to turn the non-streaming attention-based E2E-ASR system into streaming architecture. In this work, we propose a novel online E2E-ASR system by using Streaming Chunk-Aware Multihead Attention (SCAMA) and a latency control memory equipped self-attention network (LC-SAN-M). LC-SAN-M uses chunk-level input to control the latency of encoder. As to SCAMA, a jointly trained predictor is used to control the output of encoder when feeding to decoder, which enables decoder to generate output in streaming manner. Experimental results on the open 170-hour AISHELL-1 and an industrial-level 20000-hour Mandarin speech recognition tasks show that our approach can significantly outperform the MoChA-based baseline system under comparable setup. On the AISHELL-1 task, our proposed method achieves a character error rate (CER) of 7.39%, to the best of our knowledge, which is the best published performance for online ASR.

KW - Automatic Speech Recognition

KW - End-to-End

KW - LC-SAN-M

KW - Online ASR

KW - SCAMA

UR - http://www.scopus.com/inward/record.url?scp=85098125704&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2020-1972

DO - 10.21437/Interspeech.2020-1972

M3 - 会议文章

AN - SCOPUS:85098125704

SN - 2308-457X

VL - 2020-October

SP - 2142

EP - 2146

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020

Y2 - 25 October 2020 through 29 October 2020

ER -

Streaming chunk-aware multihead attention for online end-to-end speech recognition

摘要

访问文件

其它文件与链接

指纹

引用此