摘要
UniSpeech has achieved superior performance in cross-lingual automatic speech recognition (ASR) by explicitly aligning latent representations to phoneme units using multi-task self-supervised learning. While the learned representations transfer well from high-resource to low-resource languages, predicting words directly from these phonetic representations in downstream ASR is challenging. In this paper, we propose TranUSR, a two-stage model comprising a pre-trained UniData2vec and a phoneme-to-word Transcoder. Different from UniSpeech, UniData2vec replaces the quantized discrete representations with continuous and contextual representations from a teacher model for phonetically-aware pre-training. Then, Transcoder learns to translate phonemes to words with the aid of extra texts, enabling direct word generation. Experiments on Common Voice show that UniData2vec reduces PER by 5.3% compared to UniSpeech, while Transcoder yields a 14.4% WER reduction compared to grapheme fine-tuning.
| 源语言 | 英语 |
|---|---|
| 页(从-至) | 216-220 |
| 页数 | 5 |
| 期刊 | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
| 卷 | 2023-August |
| DOI | |
| 出版状态 | 已出版 - 2023 |
| 活动 | 24th Annual conference of the International Speech Communication Association, Interspeech 2023 - Dublin, 爱尔兰 期限: 20 8月 2023 → 24 8月 2023 |
指纹
探究 'TranUSR: Phoneme-to-word Transcoder Based Unified Speech Representation Learning for Cross-lingual Speech Recognition' 的科研主题。它们共同构成独一无二的指纹。引用此
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver