Query-by-Example Speech Search Using Recurrent Neural Acoustic Word Embeddings with Temporal Context

Yougen Yuan, Cheung Chi Leung, Lei Xie, Hongjie Chen, Bin Ma

Research output: Contribution to journalArticlepeer-review

14 Scopus citations

Abstract

Acoustic word embeddings (AWEs) have been popular in low-resource query-by-example speech search. They are using vector distances to find the spoken query in search content, which has much lower computation than the conventional dynamic time warping (DTW)-based approaches. The AWE networks are usually trained using variable-length isolated spoken words, while they are applied to fixed-length speech segments obtained by shifting an analysis window on speech content. There is an obvious mismatch between the learning of AWEs and its application on search content. To mitigate such mismatch, we propose to include temporal context information on spoken word pairs to learn recurrent neural AWEs. More specifically, the spoken word pairs are represented by multi-lingual bottleneck features (BNFs) and padded with the neighboring frames of the target spoken words to form fixed-length speech segment pairs. A deep bidirectional long short-term memory (BLSTM) network is then trained with a triplet loss using the speech segment pairs. Recurrent neural AWEs are obtained by concatenating the BLSTM backward and forward outputs. During QbE speech search stage, both spoken query and search content are converted into recurrent neural AWEs. Cosine distances are then measured between them to find the spoken query. The experiments show that using temporal context is essential to alleviate the mismatch. The proposed recurrent neural AWEs trained with temporal context outperform the previous state-of-art features with 12.5% relative mean average precision (MAP) improvement on QbE speech search.

Original languageEnglish
Article number8721106
Pages (from-to)67656-67665
Number of pages10
JournalIEEE Access
Volume7
DOIs
StatePublished - 2019

Keywords

  • Acoustic word embeddings
  • bidirectional long short-term memory network
  • query-by-example spoken term detection
  • spoken word pairs
  • temporal context

Fingerprint

Dive into the research topics of 'Query-by-Example Speech Search Using Recurrent Neural Acoustic Word Embeddings with Temporal Context'. Together they form a unique fingerprint.

Cite this