TY - JOUR
T1 - SEQ-former
T2 - 25th Interspeech Conferece 2024
AU - Meng, Qinglin
AU - Liu, Min
AU - Huang, Kaixun
AU - Wei, Kun
AU - Xie, Lei
AU - Quan, Zongfeng
AU - Deng, Weihong
AU - Lu, Quan
AU - Jiang, Ning
AU - Zhao, Guoqing
N1 - Publisher Copyright:
© 2024 International Speech Communication Association. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Contextual information is crucial for automatic speech recognition (ASR). Effective utilization of contextual information can improve the accuracy of ASR systems. To improve the model's ability to capture this information, we propose a novel ASR framework called SEQ-former, emphasizing simplicity, efficiency, and quickness. We incorporate a Prediction Decoder Network and a Shared Prediction Decoder Network to enhance contextual capabilities. To further increase efficiency, we use intermediate CTC and CTC Spike Reduce Methods to guide attention masks and reduce redundant peaks. Our approach demonstrates state-of-the-art performance on the AiShell-1 dataset, improves decoding efficiency, and delivers competitive results on LibriSpeech. Additionally, it optimizes 6.3% over 11,000 hours of private data compared to Efficient Conformer.
AB - Contextual information is crucial for automatic speech recognition (ASR). Effective utilization of contextual information can improve the accuracy of ASR systems. To improve the model's ability to capture this information, we propose a novel ASR framework called SEQ-former, emphasizing simplicity, efficiency, and quickness. We incorporate a Prediction Decoder Network and a Shared Prediction Decoder Network to enhance contextual capabilities. To further increase efficiency, we use intermediate CTC and CTC Spike Reduce Methods to guide attention masks and reduce redundant peaks. Our approach demonstrates state-of-the-art performance on the AiShell-1 dataset, improves decoding efficiency, and delivers competitive results on LibriSpeech. Additionally, it optimizes 6.3% over 11,000 hours of private data compared to Efficient Conformer.
KW - Blank-regularized CTC
KW - Prediction Decoder Network
KW - SEQ-former
KW - contextual information
KW - speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85214804346&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2024-243
DO - 10.21437/Interspeech.2024-243
M3 - 会议文章
AN - SCOPUS:85214804346
SN - 2308-457X
SP - 212
EP - 216
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Y2 - 1 September 2024 through 5 September 2024
ER -