TY - JOUR
T1 - Unsupervised Bottleneck features for low-resource query-by-example spoken term detection
AU - Chen, Hongjie
AU - Leung, Cheung Chi
AU - Xie, Lei
AU - Ma, Bin
AU - Li, Haizhou
N1 - Publisher Copyright:
Copyright ©2016 ISCA.
PY - 2016
Y1 - 2016
N2 - We propose a framework which ports Dirichlet Gaussian mixture model (DPGMM) based labels to deep neural network (DNN). The DNN trained using the unsupervised labels is used to extract a low-dimensional unsupervised speech representation, named as unsupervised bottleneck features (uBNFs), which capture considerable information for sound cluster discrimination. We investigate the performance of uBNF in queryby-example spoken term detection (QbE-STD) on the TIMIT English speech corpus. Our uBNF performs comparably with the cross-lingual bottleneck features (BNFs) extracted from a DNN trained using 171 hours of transcribed telephone speech in another language (Mandarin Chinese). With the score fusion of uBNFs and cross-lingual BNFs, we gain about 10% relative improvement in terms of mean average precision (MAP) comparing with the cross-lingual BNFs. We also study the performance of the framework with different input features and different lengths of temporal context.
AB - We propose a framework which ports Dirichlet Gaussian mixture model (DPGMM) based labels to deep neural network (DNN). The DNN trained using the unsupervised labels is used to extract a low-dimensional unsupervised speech representation, named as unsupervised bottleneck features (uBNFs), which capture considerable information for sound cluster discrimination. We investigate the performance of uBNF in queryby-example spoken term detection (QbE-STD) on the TIMIT English speech corpus. Our uBNF performs comparably with the cross-lingual bottleneck features (BNFs) extracted from a DNN trained using 171 hours of transcribed telephone speech in another language (Mandarin Chinese). With the score fusion of uBNFs and cross-lingual BNFs, we gain about 10% relative improvement in terms of mean average precision (MAP) comparing with the cross-lingual BNFs. We also study the performance of the framework with different input features and different lengths of temporal context.
KW - Bottleneck feature
KW - Dirichlet process Gaussian mixture model
KW - Low-resource speech processing
KW - Spoken term detection
KW - Unsupervised feature learning
UR - http://www.scopus.com/inward/record.url?scp=84994365860&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2016-313
DO - 10.21437/Interspeech.2016-313
M3 - 会议文章
AN - SCOPUS:84994365860
SN - 2308-457X
VL - 08-12-September-2016
SP - 923
EP - 927
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016
Y2 - 8 September 2016 through 16 September 2016
ER -