Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study

Peikun Chen; Sining Sun; Changhao Shan; Qing Yang; Lei Xie

doi:10.21437/Interspeech.2024-1853

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study

Peikun Chen, Sining Sun, Changhao Shan, Qing Yang, Lei Xie

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

2 引用（Scopus）

摘要

Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire speech utterance is needed during decoding. Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. Furthermore, we introduce right-chunk attention and various data augmentation techniques to improve the model's contextual modeling abilities. While achieving streaming speech recognition, experiments on the AISHELL-1 and -2 datasets demonstrate the competitive performance of our streaming approach with non-streaming decoder-only counterparts. The code we used for this work can be found here.

源语言	英语
页（从-至）	4468-4472
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOI	https://doi.org/10.21437/Interspeech.2024-1853
出版状态	已出版 - 2024
活动	25th Interspeech Conferece 2024 - Kos Island, 希腊期限: 1 9月 2024 → 5 9月 2024

访问文件

10.21437/Interspeech.2024-1853

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{3367b89b56504b6daf37d9155b9b48b9,

title = "Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study",

abstract = "Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire speech utterance is needed during decoding. Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. Furthermore, we introduce right-chunk attention and various data augmentation techniques to improve the model's contextual modeling abilities. While achieving streaming speech recognition, experiments on the AISHELL-1 and -2 datasets demonstrate the competitive performance of our streaming approach with non-streaming decoder-only counterparts. The code we used for this work can be found here.",

keywords = "decoder-only Transformer, discrete-token, streaming automatic speech recognition",

author = "Peikun Chen and Sining Sun and Changhao Shan and Qing Yang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2024 International Speech Communication Association. All rights reserved.; 25th Interspeech Conferece 2024 ; Conference date: 01-09-2024 Through 05-09-2024",

year = "2024",

doi = "10.21437/Interspeech.2024-1853",

language = "英语",

pages = "4468--4472",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units

T2 - 25th Interspeech Conferece 2024

AU - Chen, Peikun

AU - Sun, Sining

AU - Shan, Changhao

AU - Yang, Qing

AU - Xie, Lei

PY - 2024

Y1 - 2024

N2 - Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire speech utterance is needed during decoding. Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. Furthermore, we introduce right-chunk attention and various data augmentation techniques to improve the model's contextual modeling abilities. While achieving streaming speech recognition, experiments on the AISHELL-1 and -2 datasets demonstrate the competitive performance of our streaming approach with non-streaming decoder-only counterparts. The code we used for this work can be found here.

AB - Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire speech utterance is needed during decoding. Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. Furthermore, we introduce right-chunk attention and various data augmentation techniques to improve the model's contextual modeling abilities. While achieving streaming speech recognition, experiments on the AISHELL-1 and -2 datasets demonstrate the competitive performance of our streaming approach with non-streaming decoder-only counterparts. The code we used for this work can be found here.

KW - decoder-only Transformer

KW - discrete-token

KW - streaming automatic speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85202367219&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2024-1853

DO - 10.21437/Interspeech.2024-1853

M3 - 会议文章

AN - SCOPUS:85202367219

SN - 2308-457X

SP - 4468

EP - 4472

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Y2 - 1 September 2024 through 5 September 2024

ER -

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study

摘要

访问文件

其它文件与链接

指纹

引用此