Unsupervised Bottleneck features for low-resource query-by-example spoken term detection

Hongjie Chen; Cheung Chi Leung; Lei Xie; Bin Ma; Haizhou Li

doi:10.21437/Interspeech.2016-313

Unsupervised Bottleneck features for low-resource query-by-example spoken term detection

Hongjie Chen, Cheung Chi Leung, Lei Xie, Bin Ma, Haizhou Li

School of Computer Science

Research output: Contribution to journal › Conference article › peer-review

40 Scopus citations

Abstract

We propose a framework which ports Dirichlet Gaussian mixture model (DPGMM) based labels to deep neural network (DNN). The DNN trained using the unsupervised labels is used to extract a low-dimensional unsupervised speech representation, named as unsupervised bottleneck features (uBNFs), which capture considerable information for sound cluster discrimination. We investigate the performance of uBNF in queryby-example spoken term detection (QbE-STD) on the TIMIT English speech corpus. Our uBNF performs comparably with the cross-lingual bottleneck features (BNFs) extracted from a DNN trained using 171 hours of transcribed telephone speech in another language (Mandarin Chinese). With the score fusion of uBNFs and cross-lingual BNFs, we gain about 10% relative improvement in terms of mean average precision (MAP) comparing with the cross-lingual BNFs. We also study the performance of the framework with different input features and different lengths of temporal context.

Original language	English
Pages (from-to)	923-927
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	08-12-September-2016
DOIs	https://doi.org/10.21437/Interspeech.2016-313
State	Published - 2016
Event	17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016 - San Francisco, United States Duration: 8 Sep 2016 → 16 Sep 2016

Keywords

Bottleneck feature
Dirichlet process Gaussian mixture model
Low-resource speech processing
Spoken term detection
Unsupervised feature learning

Access to Document

10.21437/Interspeech.2016-313

Cite this

@article{e27098812def441c8997ff18c27ceae7,

title = "Unsupervised Bottleneck features for low-resource query-by-example spoken term detection",

abstract = "We propose a framework which ports Dirichlet Gaussian mixture model (DPGMM) based labels to deep neural network (DNN). The DNN trained using the unsupervised labels is used to extract a low-dimensional unsupervised speech representation, named as unsupervised bottleneck features (uBNFs), which capture considerable information for sound cluster discrimination. We investigate the performance of uBNF in queryby-example spoken term detection (QbE-STD) on the TIMIT English speech corpus. Our uBNF performs comparably with the cross-lingual bottleneck features (BNFs) extracted from a DNN trained using 171 hours of transcribed telephone speech in another language (Mandarin Chinese). With the score fusion of uBNFs and cross-lingual BNFs, we gain about 10% relative improvement in terms of mean average precision (MAP) comparing with the cross-lingual BNFs. We also study the performance of the framework with different input features and different lengths of temporal context.",

keywords = "Bottleneck feature, Dirichlet process Gaussian mixture model, Low-resource speech processing, Spoken term detection, Unsupervised feature learning",

author = "Hongjie Chen and Leung, {Cheung Chi} and Lei Xie and Bin Ma and Haizhou Li",

note = "Publisher Copyright: Copyright {\textcopyright}2016 ISCA.; 17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016 ; Conference date: 08-09-2016 Through 16-09-2016",

year = "2016",

doi = "10.21437/Interspeech.2016-313",

language = "英语",

volume = "08-12-September-2016",

pages = "923--927",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

Unsupervised Bottleneck features for low-resource query-by-example spoken term detection. / Chen, Hongjie; Leung, Cheung Chi; Xie, Lei et al.
In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 08-12-September-2016, 2016, p. 923-927.

Research output: Contribution to journal › Conference article › peer-review

TY - JOUR

T1 - Unsupervised Bottleneck features for low-resource query-by-example spoken term detection

AU - Chen, Hongjie

AU - Leung, Cheung Chi

AU - Xie, Lei

AU - Ma, Bin

AU - Li, Haizhou

PY - 2016

Y1 - 2016

N2 - We propose a framework which ports Dirichlet Gaussian mixture model (DPGMM) based labels to deep neural network (DNN). The DNN trained using the unsupervised labels is used to extract a low-dimensional unsupervised speech representation, named as unsupervised bottleneck features (uBNFs), which capture considerable information for sound cluster discrimination. We investigate the performance of uBNF in queryby-example spoken term detection (QbE-STD) on the TIMIT English speech corpus. Our uBNF performs comparably with the cross-lingual bottleneck features (BNFs) extracted from a DNN trained using 171 hours of transcribed telephone speech in another language (Mandarin Chinese). With the score fusion of uBNFs and cross-lingual BNFs, we gain about 10% relative improvement in terms of mean average precision (MAP) comparing with the cross-lingual BNFs. We also study the performance of the framework with different input features and different lengths of temporal context.

AB - We propose a framework which ports Dirichlet Gaussian mixture model (DPGMM) based labels to deep neural network (DNN). The DNN trained using the unsupervised labels is used to extract a low-dimensional unsupervised speech representation, named as unsupervised bottleneck features (uBNFs), which capture considerable information for sound cluster discrimination. We investigate the performance of uBNF in queryby-example spoken term detection (QbE-STD) on the TIMIT English speech corpus. Our uBNF performs comparably with the cross-lingual bottleneck features (BNFs) extracted from a DNN trained using 171 hours of transcribed telephone speech in another language (Mandarin Chinese). With the score fusion of uBNFs and cross-lingual BNFs, we gain about 10% relative improvement in terms of mean average precision (MAP) comparing with the cross-lingual BNFs. We also study the performance of the framework with different input features and different lengths of temporal context.

KW - Bottleneck feature

KW - Dirichlet process Gaussian mixture model

KW - Low-resource speech processing

KW - Spoken term detection

KW - Unsupervised feature learning

UR - http://www.scopus.com/inward/record.url?scp=84994365860&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2016-313

DO - 10.21437/Interspeech.2016-313

M3 - 会议文章

AN - SCOPUS:84994365860

SN - 2308-457X

VL - 08-12-September-2016

SP - 923

EP - 927

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016

Y2 - 8 September 2016 through 16 September 2016

ER -

Unsupervised Bottleneck features for low-resource query-by-example spoken term detection

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this