Abstract
Contextual information is crucial for automatic speech recognition (ASR). Effective utilization of contextual information can improve the accuracy of ASR systems. To improve the model's ability to capture this information, we propose a novel ASR framework called SEQ-former, emphasizing simplicity, efficiency, and quickness. We incorporate a Prediction Decoder Network and a Shared Prediction Decoder Network to enhance contextual capabilities. To further increase efficiency, we use intermediate CTC and CTC Spike Reduce Methods to guide attention masks and reduce redundant peaks. Our approach demonstrates state-of-the-art performance on the AiShell-1 dataset, improves decoding efficiency, and delivers competitive results on LibriSpeech. Additionally, it optimizes 6.3% over 11,000 hours of private data compared to Efficient Conformer.
| Original language | English |
|---|---|
| Pages (from-to) | 212-216 |
| Number of pages | 5 |
| Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
| DOIs | |
| State | Published - 2024 |
| Event | 25th Interspeech Conferece 2024 - Kos Island, Greece Duration: 1 Sep 2024 → 5 Sep 2024 |
Keywords
- Blank-regularized CTC
- Prediction Decoder Network
- SEQ-former
- contextual information
- speech recognition
Fingerprint
Dive into the research topics of 'SEQ-former: A context-enhanced and efficient automatic speech recognition framework'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver