Delayed-KD: Delayed Knowledge Distillation based CTC for Low-Latency Streaming ASR

  • Longhao Li
  • , Yangze Li
  • , Hongfei Xue
  • , Jie Liu
  • , Shuai Fang
  • , Kai Wang
  • , Lei Xie

Research output: Contribution to journalConference articlepeer-review

Abstract

CTC-based streaming ASR has gained significant attention in real-world applications but faces two main challenges: accuracy degradation in small chunks and token emission latency. To mitigate these challenges, we propose Delayed-KD, which applies delayed knowledge distillation on CTC posterior probabilities from a non-streaming to a streaming model. Specifically, with a tiny chunk size, we introduce a Temporal Alignment Buffer (TAB) that defines a relative delay range compared to the non-streaming teacher model to align CTC outputs and mitigate non-blank token mismatches. Additionally, TAB enables fine-grained control over token emission delay. Experiments on 178-hour AISHELL-1 and 10,000-hour WenetSpeech Mandarin datasets show consistent superiority of Delayed-KD. Impressively, Delayed-KD at 40 ms latency achieves a lower character error rate (CER) of 5.42% on AISHELL-1, comparable to the competitive U2++ model running at 320 ms latency.

Original languageEnglish
Pages (from-to)4413-4417
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
StatePublished - 2025
Event26th Interspeech Conference 2025 - Rotterdam, Netherlands
Duration: 17 Aug 202521 Aug 2025

Keywords

  • knowledge distillation
  • streaming speech recognition
  • Temporal Alignment Buffer
  • token emission delay

Fingerprint

Dive into the research topics of 'Delayed-KD: Delayed Knowledge Distillation based CTC for Low-Latency Streaming ASR'. Together they form a unique fingerprint.

Cite this