TY - GEN
T1 - Leveraging Synthetic Speech for CIF-Based Customized Keyword Spotting
AU - Liu, Shuiyun
AU - Zhang, Ao
AU - Huang, Kaixun
AU - Xie, Lei
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.
PY - 2024
Y1 - 2024
N2 - Customized keyword spotting aims to detect user-defined keywords from continuous speech, providing flexibility and personalization. Previous research mainly relied on similarity calculations between keyword text and acoustic features. However, due to the gap between the two modalities, it is challenging to obtain alignment information and model their correlation. In our paper, we propose a novel method to address these issues. Firstly, we introduce a text-to-speech (TTS) module to generate the audio of keywords, effectively addressing the cross-modal challenge of text-based customized keyword spotting. Furthermore, we employ the Continuous Integrate-and-Fire (CIF) mechanism for boundary prediction to get token-level acoustic representations of keywords thus solving the keyword and speech alignment problem. Our experimental results on the Aishell-1 dataset demonstrate the effectiveness of our proposed method. It significantly outperforms both the baseline method and the Dynamic Sequence Partitioning (DSP) method in terms of keyword spotting accuracy. Compared with the DSP method, our model can achieve a significant improvement in the relative wake-up rate of 72.7% when the false accept rate is fixed at 0.02. And our model represents a 64% improvement over the baseline model.
AB - Customized keyword spotting aims to detect user-defined keywords from continuous speech, providing flexibility and personalization. Previous research mainly relied on similarity calculations between keyword text and acoustic features. However, due to the gap between the two modalities, it is challenging to obtain alignment information and model their correlation. In our paper, we propose a novel method to address these issues. Firstly, we introduce a text-to-speech (TTS) module to generate the audio of keywords, effectively addressing the cross-modal challenge of text-based customized keyword spotting. Furthermore, we employ the Continuous Integrate-and-Fire (CIF) mechanism for boundary prediction to get token-level acoustic representations of keywords thus solving the keyword and speech alignment problem. Our experimental results on the Aishell-1 dataset demonstrate the effectiveness of our proposed method. It significantly outperforms both the baseline method and the Dynamic Sequence Partitioning (DSP) method in terms of keyword spotting accuracy. Compared with the DSP method, our model can achieve a significant improvement in the relative wake-up rate of 72.7% when the false accept rate is fixed at 0.02. And our model represents a 64% improvement over the baseline model.
KW - Continuous Integrate-and-Fire
KW - Keyword spotting
KW - Speech synthesis
UR - http://www.scopus.com/inward/record.url?scp=85186639441&partnerID=8YFLogxK
U2 - 10.1007/978-981-97-0601-3_31
DO - 10.1007/978-981-97-0601-3_31
M3 - 会议稿件
AN - SCOPUS:85186639441
SN - 9789819706006
T3 - Communications in Computer and Information Science
SP - 354
EP - 365
BT - Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings
A2 - Jia, Jia
A2 - Ling, Zhenhua
A2 - Chen, Xie
A2 - Li, Ya
A2 - Zhang, Zixing
PB - Springer Science and Business Media Deutschland GmbH
T2 - 18th National Conference on Man-Machine Speech Communication, NCMMSC 2023
Y2 - 8 December 2023 through 11 December 2023
ER -