Streaming chunk-aware multihead attention for online end-to-end speech recognition

  • Shiliang Zhang
  • , Zhifu Gao
  • , Haoneng Luo
  • , Ming Lei
  • , Jie Gao
  • , Zhijie Yan
  • , Lei Xie

Research output: Contribution to journalConference articlepeer-review

15 Scopus citations

Abstract

Recently, streaming end-to-end automatic speech recognition (E2E-ASR) has gained more and more attention. Many efforts have been paid to turn the non-streaming attention-based E2E-ASR system into streaming architecture. In this work, we propose a novel online E2E-ASR system by using Streaming Chunk-Aware Multihead Attention (SCAMA) and a latency control memory equipped self-attention network (LC-SAN-M). LC-SAN-M uses chunk-level input to control the latency of encoder. As to SCAMA, a jointly trained predictor is used to control the output of encoder when feeding to decoder, which enables decoder to generate output in streaming manner. Experimental results on the open 170-hour AISHELL-1 and an industrial-level 20000-hour Mandarin speech recognition tasks show that our approach can significantly outperform the MoChA-based baseline system under comparable setup. On the AISHELL-1 task, our proposed method achieves a character error rate (CER) of 7.39%, to the best of our knowledge, which is the best published performance for online ASR.

Original languageEnglish
Pages (from-to)2142-2146
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2020-October
DOIs
StatePublished - 2020
Event21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, China
Duration: 25 Oct 202029 Oct 2020

Keywords

  • Automatic Speech Recognition
  • End-to-End
  • LC-SAN-M
  • Online ASR
  • SCAMA

Fingerprint

Dive into the research topics of 'Streaming chunk-aware multihead attention for online end-to-end speech recognition'. Together they form a unique fingerprint.

Cite this