Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation

Yougen Yuan; Cheung Chi Leung; Lei Xie; Hongjie Chen; Bin Ma; Haizhou Li

doi:10.1109/ASRU.2017.8269010

Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation

Yougen Yuan, Cheung Chi Leung, Lei Xie, Hongjie Chen, Bin Ma, Haizhou Li

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

16 Scopus citations

Abstract

We propose a framework to learn a frame-level speech representation in a scenario where no manual transcription is available. Our framework is based on pairwise learning using bottleneck features (BNFs). Initial frame-level features are extracted from a bottleneck-shaped multilingual deep neural network (DNN) which is trained with unsupervised phoneme-like labels. Word-like pairs are discovered in the untranscribed speech using the initial features, and frame alignment is performed on each word-like speech pair. The matching frame pairs are used as input-output to train another DNN with the mean square error (MSE) loss function. The final frame-level features are extracted from an internal hidden layer of MSE-based DNN. Our pairwise learned feature representation is evaluated on the ZeroSpeech 2017 challenge. The experiments show that pairwise learning improves phoneme discrimination in 10s and 120s test conditions. We find that it is important to use BNFs as initial features when pairwise learning is performed. With more word pairs obtained from the Switchboard corpus and its manual transcription, the phoneme discrimination of three languages in the evaluation data can further be improved despite data mismatch.

Original language	English
Title of host publication	2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	734-739
Number of pages	6
ISBN (Electronic)	9781509047888
DOIs	https://doi.org/10.1109/ASRU.2017.8269010
State	Published - 2 Jul 2017
Event	2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Okinawa, Japan Duration: 16 Dec 2017 → 20 Dec 2017

Publication series

Name	2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
Volume	2018-January

Conference

Conference	2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017
Country/Territory	Japan
City	Okinawa
Period	16/12/17 → 20/12/17

Keywords

bottleneck features
deep neural network (DNN)
feature representation
Pairwise learning
word-like speech pairs

Access to Document

10.1109/ASRU.2017.8269010

Cite this

Yuan, Y., Leung, C. C., Xie, L., Chen, H., Ma, B., & Li, H. (2017). Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings (pp. 734-739). (2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings; Vol. 2018-January). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASRU.2017.8269010

Yuan, Yougen ; Leung, Cheung Chi ; Xie, Lei et al. / Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation. 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 734-739 (2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings).

@inproceedings{f3cafddcbb54485c87dcc55e240fb9f2,

title = "Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation",

abstract = "We propose a framework to learn a frame-level speech representation in a scenario where no manual transcription is available. Our framework is based on pairwise learning using bottleneck features (BNFs). Initial frame-level features are extracted from a bottleneck-shaped multilingual deep neural network (DNN) which is trained with unsupervised phoneme-like labels. Word-like pairs are discovered in the untranscribed speech using the initial features, and frame alignment is performed on each word-like speech pair. The matching frame pairs are used as input-output to train another DNN with the mean square error (MSE) loss function. The final frame-level features are extracted from an internal hidden layer of MSE-based DNN. Our pairwise learned feature representation is evaluated on the ZeroSpeech 2017 challenge. The experiments show that pairwise learning improves phoneme discrimination in 10s and 120s test conditions. We find that it is important to use BNFs as initial features when pairwise learning is performed. With more word pairs obtained from the Switchboard corpus and its manual transcription, the phoneme discrimination of three languages in the evaluation data can further be improved despite data mismatch.",

keywords = "bottleneck features, deep neural network (DNN), feature representation, Pairwise learning, word-like speech pairs",

author = "Yougen Yuan and Leung, {Cheung Chi} and Lei Xie and Hongjie Chen and Bin Ma and Haizhou Li",

note = "Publisher Copyright: {\textcopyright} 2017 IEEE.; 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 ; Conference date: 16-12-2017 Through 20-12-2017",

year = "2017",

month = jul,

day = "2",

doi = "10.1109/ASRU.2017.8269010",

language = "英语",

series = "2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "734--739",

booktitle = "2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings",

}

Yuan, Y, Leung, CC, Xie, L, Chen, H, Ma, B & Li, H 2017, Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation. in 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings, vol. 2018-January, Institute of Electrical and Electronics Engineers Inc., pp. 734-739, 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017, Okinawa, Japan, 16/12/17. https://doi.org/10.1109/ASRU.2017.8269010

Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation. / Yuan, Yougen; Leung, Cheung Chi; Xie, Lei et al.
2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. p. 734-739 (2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings; Vol. 2018-January).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation

AU - Yuan, Yougen

AU - Leung, Cheung Chi

AU - Xie, Lei

AU - Chen, Hongjie

AU - Ma, Bin

AU - Li, Haizhou

PY - 2017/7/2

Y1 - 2017/7/2

N2 - We propose a framework to learn a frame-level speech representation in a scenario where no manual transcription is available. Our framework is based on pairwise learning using bottleneck features (BNFs). Initial frame-level features are extracted from a bottleneck-shaped multilingual deep neural network (DNN) which is trained with unsupervised phoneme-like labels. Word-like pairs are discovered in the untranscribed speech using the initial features, and frame alignment is performed on each word-like speech pair. The matching frame pairs are used as input-output to train another DNN with the mean square error (MSE) loss function. The final frame-level features are extracted from an internal hidden layer of MSE-based DNN. Our pairwise learned feature representation is evaluated on the ZeroSpeech 2017 challenge. The experiments show that pairwise learning improves phoneme discrimination in 10s and 120s test conditions. We find that it is important to use BNFs as initial features when pairwise learning is performed. With more word pairs obtained from the Switchboard corpus and its manual transcription, the phoneme discrimination of three languages in the evaluation data can further be improved despite data mismatch.

AB - We propose a framework to learn a frame-level speech representation in a scenario where no manual transcription is available. Our framework is based on pairwise learning using bottleneck features (BNFs). Initial frame-level features are extracted from a bottleneck-shaped multilingual deep neural network (DNN) which is trained with unsupervised phoneme-like labels. Word-like pairs are discovered in the untranscribed speech using the initial features, and frame alignment is performed on each word-like speech pair. The matching frame pairs are used as input-output to train another DNN with the mean square error (MSE) loss function. The final frame-level features are extracted from an internal hidden layer of MSE-based DNN. Our pairwise learned feature representation is evaluated on the ZeroSpeech 2017 challenge. The experiments show that pairwise learning improves phoneme discrimination in 10s and 120s test conditions. We find that it is important to use BNFs as initial features when pairwise learning is performed. With more word pairs obtained from the Switchboard corpus and its manual transcription, the phoneme discrimination of three languages in the evaluation data can further be improved despite data mismatch.

KW - bottleneck features

KW - deep neural network (DNN)

KW - feature representation

KW - Pairwise learning

KW - word-like speech pairs

UR - http://www.scopus.com/inward/record.url?scp=85050570815&partnerID=8YFLogxK

U2 - 10.1109/ASRU.2017.8269010

DO - 10.1109/ASRU.2017.8269010

M3 - 会议稿件

AN - SCOPUS:85050570815

T3 - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings

SP - 734

EP - 739

BT - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017

Y2 - 16 December 2017 through 20 December 2017

ER -

Yuan Y, Leung CC, Xie L, Chen H, Ma B, Li H. Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2017. p. 734-739. (2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings). doi: 10.1109/ASRU.2017.8269010

Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this