Attention-based end-to-end models for small-footprint keyword spotting

Changhao Shan; Junbo Zhang; Yujun Wang; Lei Xie

doi:10.21437/Interspeech.2018-1777

Attention-based end-to-end models for small-footprint keyword spotting

Changhao Shan, Junbo Zhang, Yujun Wang, Lei Xie

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

59 引用（Scopus）

摘要

In this paper, we propose an attention-based end-to-end neural approach for small-footprint keyword spotting (KWS), which aims to simplify the pipelines of building a production-quality KWS system. Our model consists of an encoder and an attention mechanism. Using RNNs, the encoder transforms the input signal into a high level representation. Then the attention mechanism weights the encoder features and generates a fixed-length vector. Finally, by linear transformation and softmax function, the vector becomes a score used for keyword detection. We also evaluate the performance of different encoder architectures, including LSTM, GRU and CRNN. Experiments on wake-up data show that our approach outperforms the recent Deep KWS approach [9] by a large margin and the best performance is achieved by CRNN. To be more specific, with ∼84K parameters, our attention-based model achieves 1.02% false rejection rate (FRR) at 1.0 false alarm (FA) per hour.

源语言	英语
页（从-至）	2037-2041
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷	2018-September
DOI	https://doi.org/10.21437/Interspeech.2018-1777
出版状态	已出版 - 2018
活动	19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, 印度期限: 2 9月 2018 → 6 9月 2018

访问文件

10.21437/Interspeech.2018-1777

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{064bb2c34fc94fc18c54fdc2b07c1f87,

title = "Attention-based end-to-end models for small-footprint keyword spotting",

abstract = "In this paper, we propose an attention-based end-to-end neural approach for small-footprint keyword spotting (KWS), which aims to simplify the pipelines of building a production-quality KWS system. Our model consists of an encoder and an attention mechanism. Using RNNs, the encoder transforms the input signal into a high level representation. Then the attention mechanism weights the encoder features and generates a fixed-length vector. Finally, by linear transformation and softmax function, the vector becomes a score used for keyword detection. We also evaluate the performance of different encoder architectures, including LSTM, GRU and CRNN. Experiments on wake-up data show that our approach outperforms the recent Deep KWS approach [9] by a large margin and the best performance is achieved by CRNN. To be more specific, with ∼84K parameters, our attention-based model achieves 1.02% false rejection rate (FRR) at 1.0 false alarm (FA) per hour.",

keywords = "Attention-based model, Convolutional neural networks, End-to-end keyword spotting, Recurrent neural networks",

author = "Changhao Shan and Junbo Zhang and Yujun Wang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2018 International Speech Communication Association. All rights reserved.; 19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 ; Conference date: 02-09-2018 Through 06-09-2018",

year = "2018",

doi = "10.21437/Interspeech.2018-1777",

language = "英语",

volume = "2018-September",

pages = "2037--2041",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Attention-based end-to-end models for small-footprint keyword spotting

AU - Shan, Changhao

AU - Zhang, Junbo

AU - Wang, Yujun

AU - Xie, Lei

PY - 2018

Y1 - 2018

N2 - In this paper, we propose an attention-based end-to-end neural approach for small-footprint keyword spotting (KWS), which aims to simplify the pipelines of building a production-quality KWS system. Our model consists of an encoder and an attention mechanism. Using RNNs, the encoder transforms the input signal into a high level representation. Then the attention mechanism weights the encoder features and generates a fixed-length vector. Finally, by linear transformation and softmax function, the vector becomes a score used for keyword detection. We also evaluate the performance of different encoder architectures, including LSTM, GRU and CRNN. Experiments on wake-up data show that our approach outperforms the recent Deep KWS approach [9] by a large margin and the best performance is achieved by CRNN. To be more specific, with ∼84K parameters, our attention-based model achieves 1.02% false rejection rate (FRR) at 1.0 false alarm (FA) per hour.

AB - In this paper, we propose an attention-based end-to-end neural approach for small-footprint keyword spotting (KWS), which aims to simplify the pipelines of building a production-quality KWS system. Our model consists of an encoder and an attention mechanism. Using RNNs, the encoder transforms the input signal into a high level representation. Then the attention mechanism weights the encoder features and generates a fixed-length vector. Finally, by linear transformation and softmax function, the vector becomes a score used for keyword detection. We also evaluate the performance of different encoder architectures, including LSTM, GRU and CRNN. Experiments on wake-up data show that our approach outperforms the recent Deep KWS approach [9] by a large margin and the best performance is achieved by CRNN. To be more specific, with ∼84K parameters, our attention-based model achieves 1.02% false rejection rate (FRR) at 1.0 false alarm (FA) per hour.

KW - Attention-based model

KW - Convolutional neural networks

KW - End-to-end keyword spotting

KW - Recurrent neural networks

UR - http://www.scopus.com/inward/record.url?scp=85055000616&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2018-1777

DO - 10.21437/Interspeech.2018-1777

M3 - 会议文章

AN - SCOPUS:85055000616

SN - 2308-457X

VL - 2018-September

SP - 2037

EP - 2041

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 19th Annual Conference of the International Speech Communication, INTERSPEECH 2018

Y2 - 2 September 2018 through 6 September 2018

ER -

Attention-based end-to-end models for small-footprint keyword spotting

摘要

访问文件

其它文件与链接

指纹

引用此