Attention-based end-to-end models for small-footprint keyword spotting

Changhao Shan; Junbo Zhang; Yujun Wang; Lei Xie

doi:10.21437/Interspeech.2018-1777

Attention-based end-to-end models for small-footprint keyword spotting

Changhao Shan, Junbo Zhang, Yujun Wang, Lei Xie

School of Computer Science

Research output: Contribution to journal › Conference article › peer-review

59 Scopus citations

Abstract

In this paper, we propose an attention-based end-to-end neural approach for small-footprint keyword spotting (KWS), which aims to simplify the pipelines of building a production-quality KWS system. Our model consists of an encoder and an attention mechanism. Using RNNs, the encoder transforms the input signal into a high level representation. Then the attention mechanism weights the encoder features and generates a fixed-length vector. Finally, by linear transformation and softmax function, the vector becomes a score used for keyword detection. We also evaluate the performance of different encoder architectures, including LSTM, GRU and CRNN. Experiments on wake-up data show that our approach outperforms the recent Deep KWS approach [9] by a large margin and the best performance is achieved by CRNN. To be more specific, with ∼84K parameters, our attention-based model achieves 1.02% false rejection rate (FRR) at 1.0 false alarm (FA) per hour.

Original language	English
Pages (from-to)	2037-2041
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2018-September
DOIs	https://doi.org/10.21437/Interspeech.2018-1777
State	Published - 2018
Event	19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India Duration: 2 Sep 2018 → 6 Sep 2018

Keywords

Attention-based model
Convolutional neural networks
End-to-end keyword spotting
Recurrent neural networks

Access to Document

10.21437/Interspeech.2018-1777

Cite this

@article{064bb2c34fc94fc18c54fdc2b07c1f87,

title = "Attention-based end-to-end models for small-footprint keyword spotting",

abstract = "In this paper, we propose an attention-based end-to-end neural approach for small-footprint keyword spotting (KWS), which aims to simplify the pipelines of building a production-quality KWS system. Our model consists of an encoder and an attention mechanism. Using RNNs, the encoder transforms the input signal into a high level representation. Then the attention mechanism weights the encoder features and generates a fixed-length vector. Finally, by linear transformation and softmax function, the vector becomes a score used for keyword detection. We also evaluate the performance of different encoder architectures, including LSTM, GRU and CRNN. Experiments on wake-up data show that our approach outperforms the recent Deep KWS approach [9] by a large margin and the best performance is achieved by CRNN. To be more specific, with ∼84K parameters, our attention-based model achieves 1.02% false rejection rate (FRR) at 1.0 false alarm (FA) per hour.",

keywords = "Attention-based model, Convolutional neural networks, End-to-end keyword spotting, Recurrent neural networks",

author = "Changhao Shan and Junbo Zhang and Yujun Wang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2018 International Speech Communication Association. All rights reserved.; 19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 ; Conference date: 02-09-2018 Through 06-09-2018",

year = "2018",

doi = "10.21437/Interspeech.2018-1777",

language = "英语",

volume = "2018-September",

pages = "2037--2041",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Attention-based end-to-end models for small-footprint keyword spotting

AU - Shan, Changhao

AU - Zhang, Junbo

AU - Wang, Yujun

AU - Xie, Lei

PY - 2018

Y1 - 2018

N2 - In this paper, we propose an attention-based end-to-end neural approach for small-footprint keyword spotting (KWS), which aims to simplify the pipelines of building a production-quality KWS system. Our model consists of an encoder and an attention mechanism. Using RNNs, the encoder transforms the input signal into a high level representation. Then the attention mechanism weights the encoder features and generates a fixed-length vector. Finally, by linear transformation and softmax function, the vector becomes a score used for keyword detection. We also evaluate the performance of different encoder architectures, including LSTM, GRU and CRNN. Experiments on wake-up data show that our approach outperforms the recent Deep KWS approach [9] by a large margin and the best performance is achieved by CRNN. To be more specific, with ∼84K parameters, our attention-based model achieves 1.02% false rejection rate (FRR) at 1.0 false alarm (FA) per hour.

AB - In this paper, we propose an attention-based end-to-end neural approach for small-footprint keyword spotting (KWS), which aims to simplify the pipelines of building a production-quality KWS system. Our model consists of an encoder and an attention mechanism. Using RNNs, the encoder transforms the input signal into a high level representation. Then the attention mechanism weights the encoder features and generates a fixed-length vector. Finally, by linear transformation and softmax function, the vector becomes a score used for keyword detection. We also evaluate the performance of different encoder architectures, including LSTM, GRU and CRNN. Experiments on wake-up data show that our approach outperforms the recent Deep KWS approach [9] by a large margin and the best performance is achieved by CRNN. To be more specific, with ∼84K parameters, our attention-based model achieves 1.02% false rejection rate (FRR) at 1.0 false alarm (FA) per hour.

KW - Attention-based model

KW - Convolutional neural networks

KW - End-to-end keyword spotting

KW - Recurrent neural networks

UR - http://www.scopus.com/inward/record.url?scp=85055000616&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2018-1777

DO - 10.21437/Interspeech.2018-1777

M3 - 会议文章

AN - SCOPUS:85055000616

SN - 2308-457X

VL - 2018-September

SP - 2037

EP - 2041

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 19th Annual Conference of the International Speech Communication, INTERSPEECH 2018

Y2 - 2 September 2018 through 6 September 2018

ER -

Attention-based end-to-end models for small-footprint keyword spotting

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this