Abstract
One major source of inter-speaker variability in speaker-independent (SI) speech recognition is the variation of the vocal tract shape, especially the vocal tract length (VTL) among individual speakers. If the model of the vocal tract is assumed to be a uniform tube with a length of L, then the formant frequencies of utterances of a given sound are inversely proportional to L. Since the VTL can vary from approximately 13 cm for females to over 18 cm for males, formant center frequencies can vary by as much as 25% among speakers. This source of variability results in state-of-the-art SI speech recognizers working poorly for outlier speakers whose vocal tract shapes differ significantly from those of speakers in the training set. In an effort to reduce the degradation in speech recognition performance caused by the variation of the VTL among speakers, two methods are investigated in this paper. One is to remove the variability with a technique of speaker normalization. Another is to extract a new feature based on the Mellin transform (MT). Because of the scale invariance property of the MT, the new feature is insensitive to the variation of VTL among different speakers. Experiments show that both methods can improve the performance of SI recognizers, while the latter approach is more effective than the former one.
Original language | English |
---|---|
Pages (from-to) | 478-484 |
Number of pages | 7 |
Journal | Zidonghua Xuebao/Acta Automatica Sinica |
Volume | 26 |
Issue number | 4 |
State | Published - Jul 2000 |
Externally published | Yes |