Attention-based end-to-end speech recognition on voice search

Changhao Shan; Junbo Zhang; Yujun Wang; Lei Xie

doi:10.1109/ICASSP.2018.8462492

Attention-based end-to-end speech recognition on voice search

Changhao Shan, Junbo Zhang, Yujun Wang, Lei Xie

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

62 Scopus citations

Abstract

Recently, there has been a growing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. In this paper, we explore the use of attention-based encoder-decoder model for Mandarin speech recognition on a voice search task. Previous attempts have shown that applying attention-based encoder-decoder to Mandarin speech recognition was quite difficult due to the logographic orthography of Mandarin, the large vocabulary and the conditional dependency of the attention model. In this paper, we use character embedding to deal with the large vocabulary. Several tricks are used for effective model training, including L2 regularization, Gaussian weight noise and frame skipping. We compare two attention mechanisms and use attention smoothing to cover long context in the attention model. Taken together, these tricks allow us to finally achieve a character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% on the MiTV voice search dataset. While together with a trigram language model, CER and SER reach 2.81% and 5.77%, respectively.

Original language	English
Title of host publication	2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	4764-4768
Number of pages	5
ISBN (Print)	9781538646588
DOIs	https://doi.org/10.1109/ICASSP.2018.8462492
State	Published - 10 Sep 2018
Event	2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Calgary, Canada Duration: 15 Apr 2018 → 20 Apr 2018

Publication series

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume	2018-April
ISSN (Print)	1520-6149

Conference

Conference	2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018
Country/Territory	Canada
City	Calgary
Period	15/04/18 → 20/04/18

Keywords

Attention model
Automatic speech recognition
End-to-end speech recognition
Voice search

Access to Document

10.1109/ICASSP.2018.8462492

Cite this

Shan, C., Zhang, J., Wang, Y., & Xie, L. (2018). Attention-based end-to-end speech recognition on voice search. In 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings (pp. 4764-4768). Article 8462492 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2018-April). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2018.8462492

Shan, Changhao ; Zhang, Junbo ; Wang, Yujun et al. / Attention-based end-to-end speech recognition on voice search. 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2018. pp. 4764-4768 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{d7a3166cf2944f7eb5384ccb1646e7cd,

title = "Attention-based end-to-end speech recognition on voice search",

abstract = "Recently, there has been a growing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. In this paper, we explore the use of attention-based encoder-decoder model for Mandarin speech recognition on a voice search task. Previous attempts have shown that applying attention-based encoder-decoder to Mandarin speech recognition was quite difficult due to the logographic orthography of Mandarin, the large vocabulary and the conditional dependency of the attention model. In this paper, we use character embedding to deal with the large vocabulary. Several tricks are used for effective model training, including L2 regularization, Gaussian weight noise and frame skipping. We compare two attention mechanisms and use attention smoothing to cover long context in the attention model. Taken together, these tricks allow us to finally achieve a character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% on the MiTV voice search dataset. While together with a trigram language model, CER and SER reach 2.81% and 5.77%, respectively.",

keywords = "Attention model, Automatic speech recognition, End-to-end speech recognition, Voice search",

author = "Changhao Shan and Junbo Zhang and Yujun Wang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2018 IEEE.; 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 ; Conference date: 15-04-2018 Through 20-04-2018",

year = "2018",

month = sep,

day = "10",

doi = "10.1109/ICASSP.2018.8462492",

language = "英语",

isbn = "9781538646588",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "4764--4768",

booktitle = "2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings",

}

Shan, C, Zhang, J, Wang, Y & Xie, L 2018, Attention-based end-to-end speech recognition on voice search. in 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings., 8462492, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2018-April, Institute of Electrical and Electronics Engineers Inc., pp. 4764-4768, 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018, Calgary, Canada, 15/04/18. https://doi.org/10.1109/ICASSP.2018.8462492

Attention-based end-to-end speech recognition on voice search. / Shan, Changhao; Zhang, Junbo; Wang, Yujun et al.
2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2018. p. 4764-4768 8462492 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2018-April).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Attention-based end-to-end speech recognition on voice search

AU - Shan, Changhao

AU - Zhang, Junbo

AU - Wang, Yujun

AU - Xie, Lei

PY - 2018/9/10

Y1 - 2018/9/10

N2 - Recently, there has been a growing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. In this paper, we explore the use of attention-based encoder-decoder model for Mandarin speech recognition on a voice search task. Previous attempts have shown that applying attention-based encoder-decoder to Mandarin speech recognition was quite difficult due to the logographic orthography of Mandarin, the large vocabulary and the conditional dependency of the attention model. In this paper, we use character embedding to deal with the large vocabulary. Several tricks are used for effective model training, including L2 regularization, Gaussian weight noise and frame skipping. We compare two attention mechanisms and use attention smoothing to cover long context in the attention model. Taken together, these tricks allow us to finally achieve a character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% on the MiTV voice search dataset. While together with a trigram language model, CER and SER reach 2.81% and 5.77%, respectively.

AB - Recently, there has been a growing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. In this paper, we explore the use of attention-based encoder-decoder model for Mandarin speech recognition on a voice search task. Previous attempts have shown that applying attention-based encoder-decoder to Mandarin speech recognition was quite difficult due to the logographic orthography of Mandarin, the large vocabulary and the conditional dependency of the attention model. In this paper, we use character embedding to deal with the large vocabulary. Several tricks are used for effective model training, including L2 regularization, Gaussian weight noise and frame skipping. We compare two attention mechanisms and use attention smoothing to cover long context in the attention model. Taken together, these tricks allow us to finally achieve a character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% on the MiTV voice search dataset. While together with a trigram language model, CER and SER reach 2.81% and 5.77%, respectively.

KW - Attention model

KW - Automatic speech recognition

KW - End-to-end speech recognition

KW - Voice search

UR - http://www.scopus.com/inward/record.url?scp=85054248254&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2018.8462492

DO - 10.1109/ICASSP.2018.8462492

M3 - 会议稿件

AN - SCOPUS:85054248254

SN - 9781538646588

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 4764

EP - 4768

BT - 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018

Y2 - 15 April 2018 through 20 April 2018

ER -

Shan C, Zhang J, Wang Y, Xie L. Attention-based end-to-end speech recognition on voice search. In 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2018. p. 4764-4768. 8462492. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP.2018.8462492

Attention-based end-to-end speech recognition on voice search

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this