Speaker normalization and novel robust speech feature based on Mellin transform

Jingdong Chen; Bo Xu; Taiyi Huang

Speaker normalization and novel robust speech feature based on Mellin transform

Jingdong Chen, Bo Xu, Taiyi Huang

CAS - Institute of Automation

Research output: Contribution to journal › Article › peer-review

Abstract

One major source of inter-speaker variability in speaker-independent (SI) speech recognition is the variation of the vocal tract shape, especially the vocal tract length (VTL) among individual speakers. If the model of the vocal tract is assumed to be a uniform tube with a length of L, then the formant frequencies of utterances of a given sound are inversely proportional to L. Since the VTL can vary from approximately 13 cm for females to over 18 cm for males, formant center frequencies can vary by as much as 25% among speakers. This source of variability results in state-of-the-art SI speech recognizers working poorly for outlier speakers whose vocal tract shapes differ significantly from those of speakers in the training set. In an effort to reduce the degradation in speech recognition performance caused by the variation of the VTL among speakers, two methods are investigated in this paper. One is to remove the variability with a technique of speaker normalization. Another is to extract a new feature based on the Mellin transform (MT). Because of the scale invariance property of the MT, the new feature is insensitive to the variation of VTL among different speakers. Experiments show that both methods can improve the performance of SI recognizers, while the latter approach is more effective than the former one.

Original language	English
Pages (from-to)	478-484
Number of pages	7
Journal	Zidonghua Xuebao/Acta Automatica Sinica
Volume	26
Issue number	4
State	Published - Jul 2000
Externally published	Yes

Cite this

@article{a69e51cb031042b4aee490288762b899,

title = "Speaker normalization and novel robust speech feature based on Mellin transform",

abstract = "One major source of inter-speaker variability in speaker-independent (SI) speech recognition is the variation of the vocal tract shape, especially the vocal tract length (VTL) among individual speakers. If the model of the vocal tract is assumed to be a uniform tube with a length of L, then the formant frequencies of utterances of a given sound are inversely proportional to L. Since the VTL can vary from approximately 13 cm for females to over 18 cm for males, formant center frequencies can vary by as much as 25% among speakers. This source of variability results in state-of-the-art SI speech recognizers working poorly for outlier speakers whose vocal tract shapes differ significantly from those of speakers in the training set. In an effort to reduce the degradation in speech recognition performance caused by the variation of the VTL among speakers, two methods are investigated in this paper. One is to remove the variability with a technique of speaker normalization. Another is to extract a new feature based on the Mellin transform (MT). Because of the scale invariance property of the MT, the new feature is insensitive to the variation of VTL among different speakers. Experiments show that both methods can improve the performance of SI recognizers, while the latter approach is more effective than the former one.",

author = "Jingdong Chen and Bo Xu and Taiyi Huang",

year = "2000",

month = jul,

language = "英语",

volume = "26",

pages = "478--484",

journal = "Zidonghua Xuebao/Acta Automatica Sinica",

issn = "0254-4156",

publisher = "Science Press ",

number = "4",

}

TY - JOUR

T1 - Speaker normalization and novel robust speech feature based on Mellin transform

AU - Chen, Jingdong

AU - Xu, Bo

AU - Huang, Taiyi

PY - 2000/7

Y1 - 2000/7

N2 - One major source of inter-speaker variability in speaker-independent (SI) speech recognition is the variation of the vocal tract shape, especially the vocal tract length (VTL) among individual speakers. If the model of the vocal tract is assumed to be a uniform tube with a length of L, then the formant frequencies of utterances of a given sound are inversely proportional to L. Since the VTL can vary from approximately 13 cm for females to over 18 cm for males, formant center frequencies can vary by as much as 25% among speakers. This source of variability results in state-of-the-art SI speech recognizers working poorly for outlier speakers whose vocal tract shapes differ significantly from those of speakers in the training set. In an effort to reduce the degradation in speech recognition performance caused by the variation of the VTL among speakers, two methods are investigated in this paper. One is to remove the variability with a technique of speaker normalization. Another is to extract a new feature based on the Mellin transform (MT). Because of the scale invariance property of the MT, the new feature is insensitive to the variation of VTL among different speakers. Experiments show that both methods can improve the performance of SI recognizers, while the latter approach is more effective than the former one.

AB - One major source of inter-speaker variability in speaker-independent (SI) speech recognition is the variation of the vocal tract shape, especially the vocal tract length (VTL) among individual speakers. If the model of the vocal tract is assumed to be a uniform tube with a length of L, then the formant frequencies of utterances of a given sound are inversely proportional to L. Since the VTL can vary from approximately 13 cm for females to over 18 cm for males, formant center frequencies can vary by as much as 25% among speakers. This source of variability results in state-of-the-art SI speech recognizers working poorly for outlier speakers whose vocal tract shapes differ significantly from those of speakers in the training set. In an effort to reduce the degradation in speech recognition performance caused by the variation of the VTL among speakers, two methods are investigated in this paper. One is to remove the variability with a technique of speaker normalization. Another is to extract a new feature based on the Mellin transform (MT). Because of the scale invariance property of the MT, the new feature is insensitive to the variation of VTL among different speakers. Experiments show that both methods can improve the performance of SI recognizers, while the latter approach is more effective than the former one.

UR - http://www.scopus.com/inward/record.url?scp=0034216994&partnerID=8YFLogxK

M3 - 文章

AN - SCOPUS:0034216994

SN - 0254-4156

VL - 26

SP - 478

EP - 484

JO - Zidonghua Xuebao/Acta Automatica Sinica

JF - Zidonghua Xuebao/Acta Automatica Sinica

IS - 4

ER -

Speaker normalization and novel robust speech feature based on Mellin transform

Abstract

Other files and links

Fingerprint

Cite this