Unsupervised Bottleneck features for low-resource query-by-example spoken term detection

Hongjie Chen; Cheung Chi Leung; Lei Xie; Bin Ma; Haizhou Li

doi:10.21437/Interspeech.2016-313

Unsupervised Bottleneck features for low-resource query-by-example spoken term detection

Hongjie Chen, Cheung Chi Leung, Lei Xie, Bin Ma, Haizhou Li

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

40 引用（Scopus）

摘要

We propose a framework which ports Dirichlet Gaussian mixture model (DPGMM) based labels to deep neural network (DNN). The DNN trained using the unsupervised labels is used to extract a low-dimensional unsupervised speech representation, named as unsupervised bottleneck features (uBNFs), which capture considerable information for sound cluster discrimination. We investigate the performance of uBNF in queryby-example spoken term detection (QbE-STD) on the TIMIT English speech corpus. Our uBNF performs comparably with the cross-lingual bottleneck features (BNFs) extracted from a DNN trained using 171 hours of transcribed telephone speech in another language (Mandarin Chinese). With the score fusion of uBNFs and cross-lingual BNFs, we gain about 10% relative improvement in terms of mean average precision (MAP) comparing with the cross-lingual BNFs. We also study the performance of the framework with different input features and different lengths of temporal context.

源语言	英语
页（从-至）	923-927
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷	08-12-September-2016
DOI	https://doi.org/10.21437/Interspeech.2016-313
出版状态	已出版 - 2016
活动	17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016 - San Francisco, 美国期限: 8 9月 2016 → 16 9月 2016

访问文件

10.21437/Interspeech.2016-313

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{e27098812def441c8997ff18c27ceae7,

title = "Unsupervised Bottleneck features for low-resource query-by-example spoken term detection",

abstract = "We propose a framework which ports Dirichlet Gaussian mixture model (DPGMM) based labels to deep neural network (DNN). The DNN trained using the unsupervised labels is used to extract a low-dimensional unsupervised speech representation, named as unsupervised bottleneck features (uBNFs), which capture considerable information for sound cluster discrimination. We investigate the performance of uBNF in queryby-example spoken term detection (QbE-STD) on the TIMIT English speech corpus. Our uBNF performs comparably with the cross-lingual bottleneck features (BNFs) extracted from a DNN trained using 171 hours of transcribed telephone speech in another language (Mandarin Chinese). With the score fusion of uBNFs and cross-lingual BNFs, we gain about 10% relative improvement in terms of mean average precision (MAP) comparing with the cross-lingual BNFs. We also study the performance of the framework with different input features and different lengths of temporal context.",

keywords = "Bottleneck feature, Dirichlet process Gaussian mixture model, Low-resource speech processing, Spoken term detection, Unsupervised feature learning",

author = "Hongjie Chen and Leung, {Cheung Chi} and Lei Xie and Bin Ma and Haizhou Li",

note = "Publisher Copyright: Copyright {\textcopyright}2016 ISCA.; 17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016 ; Conference date: 08-09-2016 Through 16-09-2016",

year = "2016",

doi = "10.21437/Interspeech.2016-313",

language = "英语",

volume = "08-12-September-2016",

pages = "923--927",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Unsupervised Bottleneck features for low-resource query-by-example spoken term detection

AU - Chen, Hongjie

AU - Leung, Cheung Chi

AU - Xie, Lei

AU - Ma, Bin

AU - Li, Haizhou

PY - 2016

Y1 - 2016

N2 - We propose a framework which ports Dirichlet Gaussian mixture model (DPGMM) based labels to deep neural network (DNN). The DNN trained using the unsupervised labels is used to extract a low-dimensional unsupervised speech representation, named as unsupervised bottleneck features (uBNFs), which capture considerable information for sound cluster discrimination. We investigate the performance of uBNF in queryby-example spoken term detection (QbE-STD) on the TIMIT English speech corpus. Our uBNF performs comparably with the cross-lingual bottleneck features (BNFs) extracted from a DNN trained using 171 hours of transcribed telephone speech in another language (Mandarin Chinese). With the score fusion of uBNFs and cross-lingual BNFs, we gain about 10% relative improvement in terms of mean average precision (MAP) comparing with the cross-lingual BNFs. We also study the performance of the framework with different input features and different lengths of temporal context.

AB - We propose a framework which ports Dirichlet Gaussian mixture model (DPGMM) based labels to deep neural network (DNN). The DNN trained using the unsupervised labels is used to extract a low-dimensional unsupervised speech representation, named as unsupervised bottleneck features (uBNFs), which capture considerable information for sound cluster discrimination. We investigate the performance of uBNF in queryby-example spoken term detection (QbE-STD) on the TIMIT English speech corpus. Our uBNF performs comparably with the cross-lingual bottleneck features (BNFs) extracted from a DNN trained using 171 hours of transcribed telephone speech in another language (Mandarin Chinese). With the score fusion of uBNFs and cross-lingual BNFs, we gain about 10% relative improvement in terms of mean average precision (MAP) comparing with the cross-lingual BNFs. We also study the performance of the framework with different input features and different lengths of temporal context.

KW - Bottleneck feature

KW - Dirichlet process Gaussian mixture model

KW - Low-resource speech processing

KW - Spoken term detection

KW - Unsupervised feature learning

UR - http://www.scopus.com/inward/record.url?scp=84994365860&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2016-313

DO - 10.21437/Interspeech.2016-313

M3 - 会议文章

AN - SCOPUS:84994365860

SN - 2308-457X

VL - 08-12-September-2016

SP - 923

EP - 927

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016

Y2 - 8 September 2016 through 16 September 2016

ER -

Unsupervised Bottleneck features for low-resource query-by-example spoken term detection

摘要

访问文件

其它文件与链接

指纹

引用此