Leveraging Synthetic Speech for CIF-Based Customized Keyword Spotting

Shuiyun Liu; Ao Zhang; Kaixun Huang; Lei Xie

doi:10.1007/978-981-97-0601-3_31

Leveraging Synthetic Speech for CIF-Based Customized Keyword Spotting

Shuiyun Liu, Ao Zhang, Kaixun Huang, Lei Xie

School of Computer Science

Northwestern Polytechnical University Xian

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Customized keyword spotting aims to detect user-defined keywords from continuous speech, providing flexibility and personalization. Previous research mainly relied on similarity calculations between keyword text and acoustic features. However, due to the gap between the two modalities, it is challenging to obtain alignment information and model their correlation. In our paper, we propose a novel method to address these issues. Firstly, we introduce a text-to-speech (TTS) module to generate the audio of keywords, effectively addressing the cross-modal challenge of text-based customized keyword spotting. Furthermore, we employ the Continuous Integrate-and-Fire (CIF) mechanism for boundary prediction to get token-level acoustic representations of keywords thus solving the keyword and speech alignment problem. Our experimental results on the Aishell-1 dataset demonstrate the effectiveness of our proposed method. It significantly outperforms both the baseline method and the Dynamic Sequence Partitioning (DSP) method in terms of keyword spotting accuracy. Compared with the DSP method, our model can achieve a significant improvement in the relative wake-up rate of 72.7% when the false accept rate is fixed at 0.02. And our model represents a 64% improvement over the baseline model.

Original language	English
Title of host publication	Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings
Editors	Jia Jia, Zhenhua Ling, Xie Chen, Ya Li, Zixing Zhang
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	354-365
Number of pages	12
ISBN (Print)	9789819706006
DOIs	https://doi.org/10.1007/978-981-97-0601-3_31
State	Published - 2024
Event	18th National Conference on Man-Machine Speech Communication, NCMMSC 2023 - Suzhou, China Duration: 8 Dec 2023 → 11 Dec 2023

Publication series

Name	Communications in Computer and Information Science
Volume	2006
ISSN (Print)	1865-0929
ISSN (Electronic)	1865-0937

Conference

Conference	18th National Conference on Man-Machine Speech Communication, NCMMSC 2023
Country/Territory	China
City	Suzhou
Period	8/12/23 → 11/12/23

Keywords

Continuous Integrate-and-Fire
Keyword spotting
Speech synthesis

Access to Document

10.1007/978-981-97-0601-3_31

Cite this

Liu, S., Zhang, A., Huang, K., & Xie, L. (2024). Leveraging Synthetic Speech for CIF-Based Customized Keyword Spotting. In J. Jia, Z. Ling, X. Chen, Y. Li, & Z. Zhang (Eds.), Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings (pp. 354-365). (Communications in Computer and Information Science; Vol. 2006). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-97-0601-3_31

Liu, Shuiyun ; Zhang, Ao ; Huang, Kaixun et al. / Leveraging Synthetic Speech for CIF-Based Customized Keyword Spotting. Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings. editor / Jia Jia ; Zhenhua Ling ; Xie Chen ; Ya Li ; Zixing Zhang. Springer Science and Business Media Deutschland GmbH, 2024. pp. 354-365 (Communications in Computer and Information Science).

@inproceedings{257a001c690d44029b76aa9fb1cf9d6e,

title = "Leveraging Synthetic Speech for CIF-Based Customized Keyword Spotting",

abstract = "Customized keyword spotting aims to detect user-defined keywords from continuous speech, providing flexibility and personalization. Previous research mainly relied on similarity calculations between keyword text and acoustic features. However, due to the gap between the two modalities, it is challenging to obtain alignment information and model their correlation. In our paper, we propose a novel method to address these issues. Firstly, we introduce a text-to-speech (TTS) module to generate the audio of keywords, effectively addressing the cross-modal challenge of text-based customized keyword spotting. Furthermore, we employ the Continuous Integrate-and-Fire (CIF) mechanism for boundary prediction to get token-level acoustic representations of keywords thus solving the keyword and speech alignment problem. Our experimental results on the Aishell-1 dataset demonstrate the effectiveness of our proposed method. It significantly outperforms both the baseline method and the Dynamic Sequence Partitioning (DSP) method in terms of keyword spotting accuracy. Compared with the DSP method, our model can achieve a significant improvement in the relative wake-up rate of 72.7% when the false accept rate is fixed at 0.02. And our model represents a 64% improvement over the baseline model.",

keywords = "Continuous Integrate-and-Fire, Keyword spotting, Speech synthesis",

author = "Shuiyun Liu and Ao Zhang and Kaixun Huang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.; 18th National Conference on Man-Machine Speech Communication, NCMMSC 2023 ; Conference date: 08-12-2023 Through 11-12-2023",

year = "2024",

doi = "10.1007/978-981-97-0601-3_31",

language = "英语",

isbn = "9789819706006",

series = "Communications in Computer and Information Science",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "354--365",

editor = "Jia Jia and Zhenhua Ling and Xie Chen and Ya Li and Zixing Zhang",

booktitle = "Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings",

}

Liu, S, Zhang, A, Huang, K & Xie, L 2024, Leveraging Synthetic Speech for CIF-Based Customized Keyword Spotting. in J Jia, Z Ling, X Chen, Y Li & Z Zhang (eds), Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings. Communications in Computer and Information Science, vol. 2006, Springer Science and Business Media Deutschland GmbH, pp. 354-365, 18th National Conference on Man-Machine Speech Communication, NCMMSC 2023, Suzhou, China, 8/12/23. https://doi.org/10.1007/978-981-97-0601-3_31

Leveraging Synthetic Speech for CIF-Based Customized Keyword Spotting. / Liu, Shuiyun; Zhang, Ao; Huang, Kaixun et al.
Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings. ed. / Jia Jia; Zhenhua Ling; Xie Chen; Ya Li; Zixing Zhang. Springer Science and Business Media Deutschland GmbH, 2024. p. 354-365 (Communications in Computer and Information Science; Vol. 2006).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Leveraging Synthetic Speech for CIF-Based Customized Keyword Spotting

AU - Liu, Shuiyun

AU - Zhang, Ao

AU - Huang, Kaixun

AU - Xie, Lei

N1 - Publisher Copyright: © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.

PY - 2024

Y1 - 2024

N2 - Customized keyword spotting aims to detect user-defined keywords from continuous speech, providing flexibility and personalization. Previous research mainly relied on similarity calculations between keyword text and acoustic features. However, due to the gap between the two modalities, it is challenging to obtain alignment information and model their correlation. In our paper, we propose a novel method to address these issues. Firstly, we introduce a text-to-speech (TTS) module to generate the audio of keywords, effectively addressing the cross-modal challenge of text-based customized keyword spotting. Furthermore, we employ the Continuous Integrate-and-Fire (CIF) mechanism for boundary prediction to get token-level acoustic representations of keywords thus solving the keyword and speech alignment problem. Our experimental results on the Aishell-1 dataset demonstrate the effectiveness of our proposed method. It significantly outperforms both the baseline method and the Dynamic Sequence Partitioning (DSP) method in terms of keyword spotting accuracy. Compared with the DSP method, our model can achieve a significant improvement in the relative wake-up rate of 72.7% when the false accept rate is fixed at 0.02. And our model represents a 64% improvement over the baseline model.

AB - Customized keyword spotting aims to detect user-defined keywords from continuous speech, providing flexibility and personalization. Previous research mainly relied on similarity calculations between keyword text and acoustic features. However, due to the gap between the two modalities, it is challenging to obtain alignment information and model their correlation. In our paper, we propose a novel method to address these issues. Firstly, we introduce a text-to-speech (TTS) module to generate the audio of keywords, effectively addressing the cross-modal challenge of text-based customized keyword spotting. Furthermore, we employ the Continuous Integrate-and-Fire (CIF) mechanism for boundary prediction to get token-level acoustic representations of keywords thus solving the keyword and speech alignment problem. Our experimental results on the Aishell-1 dataset demonstrate the effectiveness of our proposed method. It significantly outperforms both the baseline method and the Dynamic Sequence Partitioning (DSP) method in terms of keyword spotting accuracy. Compared with the DSP method, our model can achieve a significant improvement in the relative wake-up rate of 72.7% when the false accept rate is fixed at 0.02. And our model represents a 64% improvement over the baseline model.

KW - Continuous Integrate-and-Fire

KW - Keyword spotting

KW - Speech synthesis

UR - http://www.scopus.com/inward/record.url?scp=85186639441&partnerID=8YFLogxK

U2 - 10.1007/978-981-97-0601-3_31

DO - 10.1007/978-981-97-0601-3_31

M3 - 会议稿件

AN - SCOPUS:85186639441

SN - 9789819706006

T3 - Communications in Computer and Information Science

SP - 354

EP - 365

BT - Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings

A2 - Jia, Jia

A2 - Ling, Zhenhua

A2 - Chen, Xie

A2 - Li, Ya

A2 - Zhang, Zixing

PB - Springer Science and Business Media Deutschland GmbH

T2 - 18th National Conference on Man-Machine Speech Communication, NCMMSC 2023

Y2 - 8 December 2023 through 11 December 2023

ER -

Liu S, Zhang A, Huang K, Xie L. Leveraging Synthetic Speech for CIF-Based Customized Keyword Spotting. In Jia J, Ling Z, Chen X, Li Y, Zhang Z, editors, Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings. Springer Science and Business Media Deutschland GmbH. 2024. p. 354-365. (Communications in Computer and Information Science). doi: 10.1007/978-981-97-0601-3_31