CaTT-KWS: A Multi-stage Customized Keyword Spotting Framework based on Cascaded Transducer-Transformer

Zhanheng Yang; Sining Sun; Jin Li; Xiaoming Zhang; Xiong Wang; Long Ma; Lei Xie

doi:10.21437/Interspeech.2022-10258

CaTT-KWS: A Multi-stage Customized Keyword Spotting Framework based on Cascaded Transducer-Transformer

Zhanheng Yang, Sining Sun, Jin Li, Xiaoming Zhang, Xiong Wang, Long Ma, Lei Xie

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

8 引用（Scopus）

摘要

Customized keyword spotting (KWS) has great potential to be deployed on edge devices to achieve hands-free user experience. However, in real applications, false alarm (FA) would be a serious problem for spotting dozens or even hundreds of keywords, which drastically affects user experience. To solve this problem, in this paper, we leverage the recent advances in transducer and transformer based acoustic models and propose a new multi-stage customized KWS framework named Cascaded Transducer-Transformer KWS (CaTT-KWS), which includes a transducer based keyword detector, a frame-level phone predictor based force alignment module and a transformer based decoder. Specifically, the streaming transducer module is used to spot keyword candidates in audio stream. Then force alignment is implemented using the phone posteriors predicted by the phone predictor to finish the first stage keyword verification and refine the time boundaries of keyword. Finally, the transformer decoder further verifies the triggered keyword. Our proposed CaTT-KWS framework reduces FA rate effectively without obviously hurting keyword recognition accuracy. Specifically, we can get impressively 0.13 FA per hour on a challenging dataset, with over 90% relative reduction on FA comparing to the transducer based detection model, while keyword recognition accuracy only drops less than 2%.

源语言	英语
页（从-至）	1681-1685
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷	2022-September
DOI	https://doi.org/10.21437/Interspeech.2022-10258
出版状态	已出版 - 2022
活动	23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, 韩国期限: 18 9月 2022 → 22 9月 2022

访问文件

10.21437/Interspeech.2022-10258

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{bd2e404398ec46908f6dbc6271a5b07a,

title = "CaTT-KWS: A Multi-stage Customized Keyword Spotting Framework based on Cascaded Transducer-Transformer",

abstract = "Customized keyword spotting (KWS) has great potential to be deployed on edge devices to achieve hands-free user experience. However, in real applications, false alarm (FA) would be a serious problem for spotting dozens or even hundreds of keywords, which drastically affects user experience. To solve this problem, in this paper, we leverage the recent advances in transducer and transformer based acoustic models and propose a new multi-stage customized KWS framework named Cascaded Transducer-Transformer KWS (CaTT-KWS), which includes a transducer based keyword detector, a frame-level phone predictor based force alignment module and a transformer based decoder. Specifically, the streaming transducer module is used to spot keyword candidates in audio stream. Then force alignment is implemented using the phone posteriors predicted by the phone predictor to finish the first stage keyword verification and refine the time boundaries of keyword. Finally, the transformer decoder further verifies the triggered keyword. Our proposed CaTT-KWS framework reduces FA rate effectively without obviously hurting keyword recognition accuracy. Specifically, we can get impressively 0.13 FA per hour on a challenging dataset, with over 90% relative reduction on FA comparing to the transducer based detection model, while keyword recognition accuracy only drops less than 2%.",

keywords = "Customized Keyword Spotting, Multi-stage detection, Multi-task learning, Transducer, Transformer",

author = "Zhanheng Yang and Sining Sun and Jin Li and Xiaoming Zhang and Xiong Wang and Long Ma and Lei Xie",

note = "Publisher Copyright: Copyright {\textcopyright} 2022 ISCA.; 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 ; Conference date: 18-09-2022 Through 22-09-2022",

year = "2022",

doi = "10.21437/Interspeech.2022-10258",

language = "英语",

volume = "2022-September",

pages = "1681--1685",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - CaTT-KWS

T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022

AU - Yang, Zhanheng

AU - Sun, Sining

AU - Li, Jin

AU - Zhang, Xiaoming

AU - Wang, Xiong

AU - Ma, Long

AU - Xie, Lei

PY - 2022

Y1 - 2022

N2 - Customized keyword spotting (KWS) has great potential to be deployed on edge devices to achieve hands-free user experience. However, in real applications, false alarm (FA) would be a serious problem for spotting dozens or even hundreds of keywords, which drastically affects user experience. To solve this problem, in this paper, we leverage the recent advances in transducer and transformer based acoustic models and propose a new multi-stage customized KWS framework named Cascaded Transducer-Transformer KWS (CaTT-KWS), which includes a transducer based keyword detector, a frame-level phone predictor based force alignment module and a transformer based decoder. Specifically, the streaming transducer module is used to spot keyword candidates in audio stream. Then force alignment is implemented using the phone posteriors predicted by the phone predictor to finish the first stage keyword verification and refine the time boundaries of keyword. Finally, the transformer decoder further verifies the triggered keyword. Our proposed CaTT-KWS framework reduces FA rate effectively without obviously hurting keyword recognition accuracy. Specifically, we can get impressively 0.13 FA per hour on a challenging dataset, with over 90% relative reduction on FA comparing to the transducer based detection model, while keyword recognition accuracy only drops less than 2%.

AB - Customized keyword spotting (KWS) has great potential to be deployed on edge devices to achieve hands-free user experience. However, in real applications, false alarm (FA) would be a serious problem for spotting dozens or even hundreds of keywords, which drastically affects user experience. To solve this problem, in this paper, we leverage the recent advances in transducer and transformer based acoustic models and propose a new multi-stage customized KWS framework named Cascaded Transducer-Transformer KWS (CaTT-KWS), which includes a transducer based keyword detector, a frame-level phone predictor based force alignment module and a transformer based decoder. Specifically, the streaming transducer module is used to spot keyword candidates in audio stream. Then force alignment is implemented using the phone posteriors predicted by the phone predictor to finish the first stage keyword verification and refine the time boundaries of keyword. Finally, the transformer decoder further verifies the triggered keyword. Our proposed CaTT-KWS framework reduces FA rate effectively without obviously hurting keyword recognition accuracy. Specifically, we can get impressively 0.13 FA per hour on a challenging dataset, with over 90% relative reduction on FA comparing to the transducer based detection model, while keyword recognition accuracy only drops less than 2%.

KW - Customized Keyword Spotting

KW - Multi-stage detection

KW - Multi-task learning

KW - Transducer

KW - Transformer

UR - http://www.scopus.com/inward/record.url?scp=85140051402&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2022-10258

DO - 10.21437/Interspeech.2022-10258

M3 - 会议文章

AN - SCOPUS:85140051402

SN - 2308-457X

VL - 2022-September

SP - 1681

EP - 1685

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Y2 - 18 September 2022 through 22 September 2022

ER -

CaTT-KWS: A Multi-stage Customized Keyword Spotting Framework based on Cascaded Transducer-Transformer

摘要

访问文件

其它文件与链接

指纹

引用此