Speaker normalization and novel robust speech feature based on Mellin transform

Jingdong Chen; Bo Xu; Taiyi Huang

Speaker normalization and novel robust speech feature based on Mellin transform

Jingdong Chen, Bo Xu, Taiyi Huang

CAS - Institute of Automation

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

One major source of inter-speaker variability in speaker-independent (SI) speech recognition is the variation of the vocal tract shape, especially the vocal tract length (VTL) among individual speakers. If the model of the vocal tract is assumed to be a uniform tube with a length of L, then the formant frequencies of utterances of a given sound are inversely proportional to L. Since the VTL can vary from approximately 13 cm for females to over 18 cm for males, formant center frequencies can vary by as much as 25% among speakers. This source of variability results in state-of-the-art SI speech recognizers working poorly for outlier speakers whose vocal tract shapes differ significantly from those of speakers in the training set. In an effort to reduce the degradation in speech recognition performance caused by the variation of the VTL among speakers, two methods are investigated in this paper. One is to remove the variability with a technique of speaker normalization. Another is to extract a new feature based on the Mellin transform (MT). Because of the scale invariance property of the MT, the new feature is insensitive to the variation of VTL among different speakers. Experiments show that both methods can improve the performance of SI recognizers, while the latter approach is more effective than the former one.

源语言	英语
页（从-至）	478-484
页数	7
期刊	Zidonghua Xuebao/Acta Automatica Sinica
卷	26
期	4
出版状态	已出版 - 7月 2000
已对外发布	是

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{a69e51cb031042b4aee490288762b899,

title = "Speaker normalization and novel robust speech feature based on Mellin transform",

abstract = "One major source of inter-speaker variability in speaker-independent (SI) speech recognition is the variation of the vocal tract shape, especially the vocal tract length (VTL) among individual speakers. If the model of the vocal tract is assumed to be a uniform tube with a length of L, then the formant frequencies of utterances of a given sound are inversely proportional to L. Since the VTL can vary from approximately 13 cm for females to over 18 cm for males, formant center frequencies can vary by as much as 25% among speakers. This source of variability results in state-of-the-art SI speech recognizers working poorly for outlier speakers whose vocal tract shapes differ significantly from those of speakers in the training set. In an effort to reduce the degradation in speech recognition performance caused by the variation of the VTL among speakers, two methods are investigated in this paper. One is to remove the variability with a technique of speaker normalization. Another is to extract a new feature based on the Mellin transform (MT). Because of the scale invariance property of the MT, the new feature is insensitive to the variation of VTL among different speakers. Experiments show that both methods can improve the performance of SI recognizers, while the latter approach is more effective than the former one.",

author = "Jingdong Chen and Bo Xu and Taiyi Huang",

year = "2000",

month = jul,

language = "英语",

volume = "26",

pages = "478--484",

journal = "Zidonghua Xuebao/Acta Automatica Sinica",

issn = "0254-4156",

publisher = "Science Press ",

number = "4",

}

TY - JOUR

T1 - Speaker normalization and novel robust speech feature based on Mellin transform

AU - Chen, Jingdong

AU - Xu, Bo

AU - Huang, Taiyi

PY - 2000/7

Y1 - 2000/7

N2 - One major source of inter-speaker variability in speaker-independent (SI) speech recognition is the variation of the vocal tract shape, especially the vocal tract length (VTL) among individual speakers. If the model of the vocal tract is assumed to be a uniform tube with a length of L, then the formant frequencies of utterances of a given sound are inversely proportional to L. Since the VTL can vary from approximately 13 cm for females to over 18 cm for males, formant center frequencies can vary by as much as 25% among speakers. This source of variability results in state-of-the-art SI speech recognizers working poorly for outlier speakers whose vocal tract shapes differ significantly from those of speakers in the training set. In an effort to reduce the degradation in speech recognition performance caused by the variation of the VTL among speakers, two methods are investigated in this paper. One is to remove the variability with a technique of speaker normalization. Another is to extract a new feature based on the Mellin transform (MT). Because of the scale invariance property of the MT, the new feature is insensitive to the variation of VTL among different speakers. Experiments show that both methods can improve the performance of SI recognizers, while the latter approach is more effective than the former one.

AB - One major source of inter-speaker variability in speaker-independent (SI) speech recognition is the variation of the vocal tract shape, especially the vocal tract length (VTL) among individual speakers. If the model of the vocal tract is assumed to be a uniform tube with a length of L, then the formant frequencies of utterances of a given sound are inversely proportional to L. Since the VTL can vary from approximately 13 cm for females to over 18 cm for males, formant center frequencies can vary by as much as 25% among speakers. This source of variability results in state-of-the-art SI speech recognizers working poorly for outlier speakers whose vocal tract shapes differ significantly from those of speakers in the training set. In an effort to reduce the degradation in speech recognition performance caused by the variation of the VTL among speakers, two methods are investigated in this paper. One is to remove the variability with a technique of speaker normalization. Another is to extract a new feature based on the Mellin transform (MT). Because of the scale invariance property of the MT, the new feature is insensitive to the variation of VTL among different speakers. Experiments show that both methods can improve the performance of SI recognizers, while the latter approach is more effective than the former one.

UR - http://www.scopus.com/inward/record.url?scp=0034216994&partnerID=8YFLogxK

M3 - 文章

AN - SCOPUS:0034216994

SN - 0254-4156

VL - 26

SP - 478

EP - 484

JO - Zidonghua Xuebao/Acta Automatica Sinica

JF - Zidonghua Xuebao/Acta Automatica Sinica

IS - 4

ER -

Speaker normalization and novel robust speech feature based on Mellin transform

摘要

其它文件与链接

指纹

引用此