Two-stage streaming keyword detection and localization with multi-scale depthwise temporal convolution

Jingyong Hou; Lei Xie; Shilei Zhang

doi:10.1016/j.neunet.2022.03.003

Two-stage streaming keyword detection and localization with multi-scale depthwise temporal convolution

Jingyong Hou, Lei Xie, Shilei Zhang

School of Computer Science

Research output: Contribution to journal › Article › peer-review

10 Scopus citations

Abstract

A keyword spotting (KWS) system running on smart devices should accurately detect the appearances and predict the locations of predefined keywords from audio streams, with small footprint and high efficiency. To this end, this paper proposes a new two-stage KWS method which combines a novel multi-scale depthwise temporal convolution (MDTC) feature extractor and a two-stage keyword detection and localization module. The MDTC feature extractor learns multi-scale feature representation efficiently with dilated depthwise temporal convolution, modeling both the temporal context and the speech rate variation. We use a region proposal network (RPN) as the first-stage KWS. At each frame, we design multiple time regions, which all take the current frame as the end position but have different start positions. These time regions (or formally anchors) are used to indicate rough location candidates of keyword. With frame level features from the MDTC feature extractor as inputs, RPN learns to propose keyword region proposals based on the designed anchors. To alleviate the keyword/non-keyword class imbalance problem, we specifically introduce a hard example mining algorithm to select effective negative anchors in RPN training. The keyword region proposals from the first-stage RPN contain keyword location information which is subsequently used to explicitly extract keyword related sequential features to train the second-stage KWS. The second-stage system learns to classify and transform region proposal to keyword IDs and ground-truth keyword region respectively. Experiments on the Google Speech Command dataset show that the proposed MDTC feature extractor surpasses several competitive feature extractors with a new state-of-the-art command classification error rate of 1.74%. With the MDTC feature extractor, we further conduct wake-up word (WuW) detection and localization experiments on a commercial WuW dataset. Compared to a strong baseline, our proposed two-stage method achieves relatively 27–32% better false rejection rate at one false alarm per hour, while for keyword localization, the two-stage approach achieves more than 0.95 mean intersection-over-union ratio, which is clearly better than the one-stage RPN method.

Original language	English
Pages (from-to)	28-42
Number of pages	15
Journal	Neural Networks
Volume	150
DOIs	https://doi.org/10.1016/j.neunet.2022.03.003
State	Published - Jun 2022

Keywords

Keyword spotting
Multi-scale
Temporal convolution
Two-stage
Wake-up word detection and localization

Access to Document

10.1016/j.neunet.2022.03.003

Cite this

@article{e2aee804567841749c60b1158f591fa6,

title = "Two-stage streaming keyword detection and localization with multi-scale depthwise temporal convolution",

abstract = "A keyword spotting (KWS) system running on smart devices should accurately detect the appearances and predict the locations of predefined keywords from audio streams, with small footprint and high efficiency. To this end, this paper proposes a new two-stage KWS method which combines a novel multi-scale depthwise temporal convolution (MDTC) feature extractor and a two-stage keyword detection and localization module. The MDTC feature extractor learns multi-scale feature representation efficiently with dilated depthwise temporal convolution, modeling both the temporal context and the speech rate variation. We use a region proposal network (RPN) as the first-stage KWS. At each frame, we design multiple time regions, which all take the current frame as the end position but have different start positions. These time regions (or formally anchors) are used to indicate rough location candidates of keyword. With frame level features from the MDTC feature extractor as inputs, RPN learns to propose keyword region proposals based on the designed anchors. To alleviate the keyword/non-keyword class imbalance problem, we specifically introduce a hard example mining algorithm to select effective negative anchors in RPN training. The keyword region proposals from the first-stage RPN contain keyword location information which is subsequently used to explicitly extract keyword related sequential features to train the second-stage KWS. The second-stage system learns to classify and transform region proposal to keyword IDs and ground-truth keyword region respectively. Experiments on the Google Speech Command dataset show that the proposed MDTC feature extractor surpasses several competitive feature extractors with a new state-of-the-art command classification error rate of 1.74%. With the MDTC feature extractor, we further conduct wake-up word (WuW) detection and localization experiments on a commercial WuW dataset. Compared to a strong baseline, our proposed two-stage method achieves relatively 27–32% better false rejection rate at one false alarm per hour, while for keyword localization, the two-stage approach achieves more than 0.95 mean intersection-over-union ratio, which is clearly better than the one-stage RPN method.",

keywords = "Keyword spotting, Multi-scale, Temporal convolution, Two-stage, Wake-up word detection and localization",

author = "Jingyong Hou and Lei Xie and Shilei Zhang",

note = "Publisher Copyright: {\textcopyright} 2022 Elsevier Ltd",

year = "2022",

month = jun,

doi = "10.1016/j.neunet.2022.03.003",

language = "英语",

volume = "150",

pages = "28--42",

journal = "Neural Networks",

issn = "0893-6080",

publisher = "Elsevier Ltd",

}

TY - JOUR

T1 - Two-stage streaming keyword detection and localization with multi-scale depthwise temporal convolution

AU - Hou, Jingyong

AU - Xie, Lei

AU - Zhang, Shilei

PY - 2022/6

Y1 - 2022/6

N2 - A keyword spotting (KWS) system running on smart devices should accurately detect the appearances and predict the locations of predefined keywords from audio streams, with small footprint and high efficiency. To this end, this paper proposes a new two-stage KWS method which combines a novel multi-scale depthwise temporal convolution (MDTC) feature extractor and a two-stage keyword detection and localization module. The MDTC feature extractor learns multi-scale feature representation efficiently with dilated depthwise temporal convolution, modeling both the temporal context and the speech rate variation. We use a region proposal network (RPN) as the first-stage KWS. At each frame, we design multiple time regions, which all take the current frame as the end position but have different start positions. These time regions (or formally anchors) are used to indicate rough location candidates of keyword. With frame level features from the MDTC feature extractor as inputs, RPN learns to propose keyword region proposals based on the designed anchors. To alleviate the keyword/non-keyword class imbalance problem, we specifically introduce a hard example mining algorithm to select effective negative anchors in RPN training. The keyword region proposals from the first-stage RPN contain keyword location information which is subsequently used to explicitly extract keyword related sequential features to train the second-stage KWS. The second-stage system learns to classify and transform region proposal to keyword IDs and ground-truth keyword region respectively. Experiments on the Google Speech Command dataset show that the proposed MDTC feature extractor surpasses several competitive feature extractors with a new state-of-the-art command classification error rate of 1.74%. With the MDTC feature extractor, we further conduct wake-up word (WuW) detection and localization experiments on a commercial WuW dataset. Compared to a strong baseline, our proposed two-stage method achieves relatively 27–32% better false rejection rate at one false alarm per hour, while for keyword localization, the two-stage approach achieves more than 0.95 mean intersection-over-union ratio, which is clearly better than the one-stage RPN method.

AB - A keyword spotting (KWS) system running on smart devices should accurately detect the appearances and predict the locations of predefined keywords from audio streams, with small footprint and high efficiency. To this end, this paper proposes a new two-stage KWS method which combines a novel multi-scale depthwise temporal convolution (MDTC) feature extractor and a two-stage keyword detection and localization module. The MDTC feature extractor learns multi-scale feature representation efficiently with dilated depthwise temporal convolution, modeling both the temporal context and the speech rate variation. We use a region proposal network (RPN) as the first-stage KWS. At each frame, we design multiple time regions, which all take the current frame as the end position but have different start positions. These time regions (or formally anchors) are used to indicate rough location candidates of keyword. With frame level features from the MDTC feature extractor as inputs, RPN learns to propose keyword region proposals based on the designed anchors. To alleviate the keyword/non-keyword class imbalance problem, we specifically introduce a hard example mining algorithm to select effective negative anchors in RPN training. The keyword region proposals from the first-stage RPN contain keyword location information which is subsequently used to explicitly extract keyword related sequential features to train the second-stage KWS. The second-stage system learns to classify and transform region proposal to keyword IDs and ground-truth keyword region respectively. Experiments on the Google Speech Command dataset show that the proposed MDTC feature extractor surpasses several competitive feature extractors with a new state-of-the-art command classification error rate of 1.74%. With the MDTC feature extractor, we further conduct wake-up word (WuW) detection and localization experiments on a commercial WuW dataset. Compared to a strong baseline, our proposed two-stage method achieves relatively 27–32% better false rejection rate at one false alarm per hour, while for keyword localization, the two-stage approach achieves more than 0.95 mean intersection-over-union ratio, which is clearly better than the one-stage RPN method.

KW - Keyword spotting

KW - Multi-scale

KW - Temporal convolution

KW - Two-stage

KW - Wake-up word detection and localization

UR - http://www.scopus.com/inward/record.url?scp=85126617293&partnerID=8YFLogxK

U2 - 10.1016/j.neunet.2022.03.003

DO - 10.1016/j.neunet.2022.03.003

M3 - 文章

C2 - 35303660

AN - SCOPUS:85126617293

SN - 0893-6080

VL - 150

SP - 28

EP - 42

JO - Neural Networks

JF - Neural Networks

ER -

Two-stage streaming keyword detection and localization with multi-scale depthwise temporal convolution

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this