Attention-based end-to-end speech recognition on voice search

Changhao Shan; Junbo Zhang; Yujun Wang; Lei Xie

doi:10.1109/ICASSP.2018.8462492

Attention-based end-to-end speech recognition on voice search

Changhao Shan, Junbo Zhang, Yujun Wang, Lei Xie

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

62 引用（Scopus）

摘要

Recently, there has been a growing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. In this paper, we explore the use of attention-based encoder-decoder model for Mandarin speech recognition on a voice search task. Previous attempts have shown that applying attention-based encoder-decoder to Mandarin speech recognition was quite difficult due to the logographic orthography of Mandarin, the large vocabulary and the conditional dependency of the attention model. In this paper, we use character embedding to deal with the large vocabulary. Several tricks are used for effective model training, including L2 regularization, Gaussian weight noise and frame skipping. We compare two attention mechanisms and use attention smoothing to cover long context in the attention model. Taken together, these tricks allow us to finally achieve a character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% on the MiTV voice search dataset. While together with a trigram language model, CER and SER reach 2.81% and 5.77%, respectively.

源语言	英语
主期刊名	2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings
出版商	Institute of Electrical and Electronics Engineers Inc.
页	4764-4768
页数	5
ISBN（印刷版）	9781538646588
DOI	https://doi.org/10.1109/ICASSP.2018.8462492
出版状态	已出版 - 10 9月 2018
活动	2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Calgary, 加拿大期限: 15 4月 2018 → 20 4月 2018

出版系列

姓名	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
卷	2018-April
ISSN（印刷版）	1520-6149

会议

会议	2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018
国家/地区	加拿大
市	Calgary
时期	15/04/18 → 20/04/18

访问文件

10.1109/ICASSP.2018.8462492

其它文件与链接

链接到 Scopus 的出版物

引用此

Shan, C., Zhang, J., Wang, Y., & Xie, L. (2018). Attention-based end-to-end speech recognition on voice search. 在 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings (页码 4764-4768). 文章 8462492 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; 卷 2018-April). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2018.8462492

Shan, Changhao ; Zhang, Junbo ; Wang, Yujun 等. / Attention-based end-to-end speech recognition on voice search. 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2018. 页码 4764-4768 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{d7a3166cf2944f7eb5384ccb1646e7cd,

title = "Attention-based end-to-end speech recognition on voice search",

abstract = "Recently, there has been a growing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. In this paper, we explore the use of attention-based encoder-decoder model for Mandarin speech recognition on a voice search task. Previous attempts have shown that applying attention-based encoder-decoder to Mandarin speech recognition was quite difficult due to the logographic orthography of Mandarin, the large vocabulary and the conditional dependency of the attention model. In this paper, we use character embedding to deal with the large vocabulary. Several tricks are used for effective model training, including L2 regularization, Gaussian weight noise and frame skipping. We compare two attention mechanisms and use attention smoothing to cover long context in the attention model. Taken together, these tricks allow us to finally achieve a character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% on the MiTV voice search dataset. While together with a trigram language model, CER and SER reach 2.81% and 5.77%, respectively.",

keywords = "Attention model, Automatic speech recognition, End-to-end speech recognition, Voice search",

author = "Changhao Shan and Junbo Zhang and Yujun Wang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2018 IEEE.; 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 ; Conference date: 15-04-2018 Through 20-04-2018",

year = "2018",

month = sep,

day = "10",

doi = "10.1109/ICASSP.2018.8462492",

language = "英语",

isbn = "9781538646588",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "4764--4768",

booktitle = "2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings",

}

Shan, C, Zhang, J, Wang, Y & Xie, L 2018, Attention-based end-to-end speech recognition on voice search. 在 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings., 8462492, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 卷 2018-April, Institute of Electrical and Electronics Engineers Inc., 页码 4764-4768, 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018, Calgary, 加拿大, 15/04/18. https://doi.org/10.1109/ICASSP.2018.8462492

Attention-based end-to-end speech recognition on voice search. / Shan, Changhao; Zhang, Junbo; Wang, Yujun 等.
2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2018. 页码 4764-4768 8462492 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; 卷 2018-April).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Attention-based end-to-end speech recognition on voice search

AU - Shan, Changhao

AU - Zhang, Junbo

AU - Wang, Yujun

AU - Xie, Lei

PY - 2018/9/10

Y1 - 2018/9/10

N2 - Recently, there has been a growing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. In this paper, we explore the use of attention-based encoder-decoder model for Mandarin speech recognition on a voice search task. Previous attempts have shown that applying attention-based encoder-decoder to Mandarin speech recognition was quite difficult due to the logographic orthography of Mandarin, the large vocabulary and the conditional dependency of the attention model. In this paper, we use character embedding to deal with the large vocabulary. Several tricks are used for effective model training, including L2 regularization, Gaussian weight noise and frame skipping. We compare two attention mechanisms and use attention smoothing to cover long context in the attention model. Taken together, these tricks allow us to finally achieve a character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% on the MiTV voice search dataset. While together with a trigram language model, CER and SER reach 2.81% and 5.77%, respectively.

AB - Recently, there has been a growing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. In this paper, we explore the use of attention-based encoder-decoder model for Mandarin speech recognition on a voice search task. Previous attempts have shown that applying attention-based encoder-decoder to Mandarin speech recognition was quite difficult due to the logographic orthography of Mandarin, the large vocabulary and the conditional dependency of the attention model. In this paper, we use character embedding to deal with the large vocabulary. Several tricks are used for effective model training, including L2 regularization, Gaussian weight noise and frame skipping. We compare two attention mechanisms and use attention smoothing to cover long context in the attention model. Taken together, these tricks allow us to finally achieve a character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% on the MiTV voice search dataset. While together with a trigram language model, CER and SER reach 2.81% and 5.77%, respectively.

KW - Attention model

KW - Automatic speech recognition

KW - End-to-end speech recognition

KW - Voice search

UR - http://www.scopus.com/inward/record.url?scp=85054248254&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2018.8462492

DO - 10.1109/ICASSP.2018.8462492

M3 - 会议稿件

AN - SCOPUS:85054248254

SN - 9781538646588

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 4764

EP - 4768

BT - 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018

Y2 - 15 April 2018 through 20 April 2018

ER -

Shan C, Zhang J, Wang Y, Xie L. Attention-based end-to-end speech recognition on voice search. 在 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2018. 页码 4764-4768. 8462492. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP.2018.8462492

Attention-based end-to-end speech recognition on voice search

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此