SEQ-former: A context-enhanced and efficient automatic speech recognition framework

Qinglin Meng; Min Liu; Kaixun Huang; Kun Wei; Lei Xie; Zongfeng Quan; Weihong Deng; Quan Lu; Ning Jiang; Guoqing Zhao

doi:10.21437/Interspeech.2024-243

SEQ-former: A context-enhanced and efficient automatic speech recognition framework

Qinglin Meng, Min Liu, Kaixun Huang, Kun Wei, Lei Xie, Zongfeng Quan, Weihong Deng, Quan Lu, Ning Jiang, Guoqing Zhao

School of Computer Science

Research output: Contribution to journal › Conference article › peer-review

Abstract

Contextual information is crucial for automatic speech recognition (ASR). Effective utilization of contextual information can improve the accuracy of ASR systems. To improve the model's ability to capture this information, we propose a novel ASR framework called SEQ-former, emphasizing simplicity, efficiency, and quickness. We incorporate a Prediction Decoder Network and a Shared Prediction Decoder Network to enhance contextual capabilities. To further increase efficiency, we use intermediate CTC and CTC Spike Reduce Methods to guide attention masks and reduce redundant peaks. Our approach demonstrates state-of-the-art performance on the AiShell-1 dataset, improves decoding efficiency, and delivers competitive results on LibriSpeech. Additionally, it optimizes 6.3% over 11,000 hours of private data compared to Efficient Conformer.

Original language	English
Pages (from-to)	212-216
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs	https://doi.org/10.21437/Interspeech.2024-243
State	Published - 2024
Event	25th Interspeech Conferece 2024 - Kos Island, Greece Duration: 1 Sep 2024 → 5 Sep 2024

Keywords

Blank-regularized CTC
Prediction Decoder Network
SEQ-former
contextual information
speech recognition

Access to Document

10.21437/Interspeech.2024-243

Cite this

@article{d1498400e443402fa9bb8c51b95cf436,

title = "SEQ-former: A context-enhanced and efficient automatic speech recognition framework",

abstract = "Contextual information is crucial for automatic speech recognition (ASR). Effective utilization of contextual information can improve the accuracy of ASR systems. To improve the model's ability to capture this information, we propose a novel ASR framework called SEQ-former, emphasizing simplicity, efficiency, and quickness. We incorporate a Prediction Decoder Network and a Shared Prediction Decoder Network to enhance contextual capabilities. To further increase efficiency, we use intermediate CTC and CTC Spike Reduce Methods to guide attention masks and reduce redundant peaks. Our approach demonstrates state-of-the-art performance on the AiShell-1 dataset, improves decoding efficiency, and delivers competitive results on LibriSpeech. Additionally, it optimizes 6.3% over 11,000 hours of private data compared to Efficient Conformer.",

keywords = "Blank-regularized CTC, Prediction Decoder Network, SEQ-former, contextual information, speech recognition",

author = "Qinglin Meng and Min Liu and Kaixun Huang and Kun Wei and Lei Xie and Zongfeng Quan and Weihong Deng and Quan Lu and Ning Jiang and Guoqing Zhao",

note = "Publisher Copyright: {\textcopyright} 2024 International Speech Communication Association. All rights reserved.; 25th Interspeech Conferece 2024 ; Conference date: 01-09-2024 Through 05-09-2024",

year = "2024",

doi = "10.21437/Interspeech.2024-243",

language = "英语",

pages = "212--216",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - SEQ-former

T2 - 25th Interspeech Conferece 2024

AU - Meng, Qinglin

AU - Liu, Min

AU - Huang, Kaixun

AU - Wei, Kun

AU - Xie, Lei

AU - Quan, Zongfeng

AU - Deng, Weihong

AU - Lu, Quan

AU - Jiang, Ning

AU - Zhao, Guoqing

PY - 2024

Y1 - 2024

N2 - Contextual information is crucial for automatic speech recognition (ASR). Effective utilization of contextual information can improve the accuracy of ASR systems. To improve the model's ability to capture this information, we propose a novel ASR framework called SEQ-former, emphasizing simplicity, efficiency, and quickness. We incorporate a Prediction Decoder Network and a Shared Prediction Decoder Network to enhance contextual capabilities. To further increase efficiency, we use intermediate CTC and CTC Spike Reduce Methods to guide attention masks and reduce redundant peaks. Our approach demonstrates state-of-the-art performance on the AiShell-1 dataset, improves decoding efficiency, and delivers competitive results on LibriSpeech. Additionally, it optimizes 6.3% over 11,000 hours of private data compared to Efficient Conformer.

AB - Contextual information is crucial for automatic speech recognition (ASR). Effective utilization of contextual information can improve the accuracy of ASR systems. To improve the model's ability to capture this information, we propose a novel ASR framework called SEQ-former, emphasizing simplicity, efficiency, and quickness. We incorporate a Prediction Decoder Network and a Shared Prediction Decoder Network to enhance contextual capabilities. To further increase efficiency, we use intermediate CTC and CTC Spike Reduce Methods to guide attention masks and reduce redundant peaks. Our approach demonstrates state-of-the-art performance on the AiShell-1 dataset, improves decoding efficiency, and delivers competitive results on LibriSpeech. Additionally, it optimizes 6.3% over 11,000 hours of private data compared to Efficient Conformer.

KW - Blank-regularized CTC

KW - Prediction Decoder Network

KW - SEQ-former

KW - contextual information

KW - speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85214804346&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2024-243

DO - 10.21437/Interspeech.2024-243

M3 - 会议文章

AN - SCOPUS:85214804346

SN - 2308-457X

SP - 212

EP - 216

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Y2 - 1 September 2024 through 5 September 2024

ER -

SEQ-former: A context-enhanced and efficient automatic speech recognition framework

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this