Query-by-Example Speech Search Using Recurrent Neural Acoustic Word Embeddings with Temporal Context

Yougen Yuan; Cheung Chi Leung; Lei Xie; Hongjie Chen; Bin Ma

doi:10.1109/ACCESS.2019.2918638

Query-by-Example Speech Search Using Recurrent Neural Acoustic Word Embeddings with Temporal Context

Yougen Yuan, Cheung Chi Leung, Lei Xie, Hongjie Chen, Bin Ma

School of Computer Science

Research output: Contribution to journal › Article › peer-review

14 Scopus citations

Abstract

Acoustic word embeddings (AWEs) have been popular in low-resource query-by-example speech search. They are using vector distances to find the spoken query in search content, which has much lower computation than the conventional dynamic time warping (DTW)-based approaches. The AWE networks are usually trained using variable-length isolated spoken words, while they are applied to fixed-length speech segments obtained by shifting an analysis window on speech content. There is an obvious mismatch between the learning of AWEs and its application on search content. To mitigate such mismatch, we propose to include temporal context information on spoken word pairs to learn recurrent neural AWEs. More specifically, the spoken word pairs are represented by multi-lingual bottleneck features (BNFs) and padded with the neighboring frames of the target spoken words to form fixed-length speech segment pairs. A deep bidirectional long short-term memory (BLSTM) network is then trained with a triplet loss using the speech segment pairs. Recurrent neural AWEs are obtained by concatenating the BLSTM backward and forward outputs. During QbE speech search stage, both spoken query and search content are converted into recurrent neural AWEs. Cosine distances are then measured between them to find the spoken query. The experiments show that using temporal context is essential to alleviate the mismatch. The proposed recurrent neural AWEs trained with temporal context outperform the previous state-of-art features with 12.5% relative mean average precision (MAP) improvement on QbE speech search.

Original language	English
Article number	8721106
Pages (from-to)	67656-67665
Number of pages	10
Journal	IEEE Access
Volume	7
DOIs	https://doi.org/10.1109/ACCESS.2019.2918638
State	Published - 2019

Keywords

Acoustic word embeddings
bidirectional long short-term memory network
query-by-example spoken term detection
spoken word pairs
temporal context

Access to Document

10.1109/ACCESS.2019.2918638

Cite this

@article{903d35cf71594811947fa73d9c8172cb,

title = "Query-by-Example Speech Search Using Recurrent Neural Acoustic Word Embeddings with Temporal Context",

abstract = "Acoustic word embeddings (AWEs) have been popular in low-resource query-by-example speech search. They are using vector distances to find the spoken query in search content, which has much lower computation than the conventional dynamic time warping (DTW)-based approaches. The AWE networks are usually trained using variable-length isolated spoken words, while they are applied to fixed-length speech segments obtained by shifting an analysis window on speech content. There is an obvious mismatch between the learning of AWEs and its application on search content. To mitigate such mismatch, we propose to include temporal context information on spoken word pairs to learn recurrent neural AWEs. More specifically, the spoken word pairs are represented by multi-lingual bottleneck features (BNFs) and padded with the neighboring frames of the target spoken words to form fixed-length speech segment pairs. A deep bidirectional long short-term memory (BLSTM) network is then trained with a triplet loss using the speech segment pairs. Recurrent neural AWEs are obtained by concatenating the BLSTM backward and forward outputs. During QbE speech search stage, both spoken query and search content are converted into recurrent neural AWEs. Cosine distances are then measured between them to find the spoken query. The experiments show that using temporal context is essential to alleviate the mismatch. The proposed recurrent neural AWEs trained with temporal context outperform the previous state-of-art features with 12.5% relative mean average precision (MAP) improvement on QbE speech search.",

keywords = "Acoustic word embeddings, bidirectional long short-term memory network, query-by-example spoken term detection, spoken word pairs, temporal context",

author = "Yougen Yuan and Leung, {Cheung Chi} and Lei Xie and Hongjie Chen and Bin Ma",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2019",

doi = "10.1109/ACCESS.2019.2918638",

language = "英语",

volume = "7",

pages = "67656--67665",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Query-by-Example Speech Search Using Recurrent Neural Acoustic Word Embeddings with Temporal Context

AU - Yuan, Yougen

AU - Leung, Cheung Chi

AU - Xie, Lei

AU - Chen, Hongjie

AU - Ma, Bin

PY - 2019

Y1 - 2019

N2 - Acoustic word embeddings (AWEs) have been popular in low-resource query-by-example speech search. They are using vector distances to find the spoken query in search content, which has much lower computation than the conventional dynamic time warping (DTW)-based approaches. The AWE networks are usually trained using variable-length isolated spoken words, while they are applied to fixed-length speech segments obtained by shifting an analysis window on speech content. There is an obvious mismatch between the learning of AWEs and its application on search content. To mitigate such mismatch, we propose to include temporal context information on spoken word pairs to learn recurrent neural AWEs. More specifically, the spoken word pairs are represented by multi-lingual bottleneck features (BNFs) and padded with the neighboring frames of the target spoken words to form fixed-length speech segment pairs. A deep bidirectional long short-term memory (BLSTM) network is then trained with a triplet loss using the speech segment pairs. Recurrent neural AWEs are obtained by concatenating the BLSTM backward and forward outputs. During QbE speech search stage, both spoken query and search content are converted into recurrent neural AWEs. Cosine distances are then measured between them to find the spoken query. The experiments show that using temporal context is essential to alleviate the mismatch. The proposed recurrent neural AWEs trained with temporal context outperform the previous state-of-art features with 12.5% relative mean average precision (MAP) improvement on QbE speech search.

AB - Acoustic word embeddings (AWEs) have been popular in low-resource query-by-example speech search. They are using vector distances to find the spoken query in search content, which has much lower computation than the conventional dynamic time warping (DTW)-based approaches. The AWE networks are usually trained using variable-length isolated spoken words, while they are applied to fixed-length speech segments obtained by shifting an analysis window on speech content. There is an obvious mismatch between the learning of AWEs and its application on search content. To mitigate such mismatch, we propose to include temporal context information on spoken word pairs to learn recurrent neural AWEs. More specifically, the spoken word pairs are represented by multi-lingual bottleneck features (BNFs) and padded with the neighboring frames of the target spoken words to form fixed-length speech segment pairs. A deep bidirectional long short-term memory (BLSTM) network is then trained with a triplet loss using the speech segment pairs. Recurrent neural AWEs are obtained by concatenating the BLSTM backward and forward outputs. During QbE speech search stage, both spoken query and search content are converted into recurrent neural AWEs. Cosine distances are then measured between them to find the spoken query. The experiments show that using temporal context is essential to alleviate the mismatch. The proposed recurrent neural AWEs trained with temporal context outperform the previous state-of-art features with 12.5% relative mean average precision (MAP) improvement on QbE speech search.

KW - Acoustic word embeddings

KW - bidirectional long short-term memory network

KW - query-by-example spoken term detection

KW - spoken word pairs

KW - temporal context

UR - http://www.scopus.com/inward/record.url?scp=85067199637&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2019.2918638

DO - 10.1109/ACCESS.2019.2918638

M3 - 文章

AN - SCOPUS:85067199637

SN - 2169-3536

VL - 7

SP - 67656

EP - 67665

JO - IEEE Access

JF - IEEE Access

M1 - 8721106

ER -

Query-by-Example Speech Search Using Recurrent Neural Acoustic Word Embeddings with Temporal Context

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this