Sa-Paraformer: Non-Autoregressive End-To-End Speaker-Attributed ASR

Yangze Li; Fan Yu; Yuhao Liang; Pengcheng Guo; Mohan Shi; Zhihao Du; Shiliang Zhang; Lei Xie

doi:10.1109/ASRU57964.2023.10389762

Sa-Paraformer: Non-Autoregressive End-To-End Speaker-Attributed ASR

Yangze Li, Fan Yu, Yuhao Liang, Pengcheng Guo, Mohan Shi, Zhihao Du, Shiliang Zhang, Lei Xie

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

4 引用（Scopus）

摘要

Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR). Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we introduce a recently proposed non-autoregressive model Paraformer as an acoustic model in the SA-ASR model. Paraformer uses a single-step decoder to enable parallel generation, obtaining comparable performance to the SOTA AR transformer models. Besides, we propose a speaker-filling strategy to reduce speaker identification errors and adopt an inter-CTC strategy to enhance the encoder's ability in acoustic modeling. Experiments on the AliMeeting corpus show that our model outperforms the cascaded SA-ASR model by a 6.1% relative speaker-dependent character error rate (SD-CER) reduction on the test set. Moreover, our model achieves a comparable SD-CER of 34.8% with only 1/10 RTF compared with the SOTA joint AR SA-ASR model.

源语言	英语
主期刊名	2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
出版商	Institute of Electrical and Electronics Engineers Inc.
ISBN（电子版）	9798350306897
DOI	https://doi.org/10.1109/ASRU57964.2023.10389762
出版状态	已出版 - 2023
活动	2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 - Taipei, 中国台湾期限: 16 12月 2023 → 20 12月 2023

出版系列

姓名	2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

会议

会议	2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
国家/地区	中国台湾
市	Taipei
时期	16/12/23 → 20/12/23

访问文件

10.1109/ASRU57964.2023.10389762

其它文件与链接

链接到 Scopus 的出版物

引用此

Li, Y., Yu, F., Liang, Y., Guo, P., Shi, M., Du, Z., Zhang, S., & Xie, L. (2023). Sa-Paraformer: Non-Autoregressive End-To-End Speaker-Attributed ASR. 在 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 (2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASRU57964.2023.10389762

@inproceedings{3e0792dcfd344459a8aa34ae0a62fe19,

title = "Sa-Paraformer: Non-Autoregressive End-To-End Speaker-Attributed ASR",

abstract = "Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR). Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we introduce a recently proposed non-autoregressive model Paraformer as an acoustic model in the SA-ASR model. Paraformer uses a single-step decoder to enable parallel generation, obtaining comparable performance to the SOTA AR transformer models. Besides, we propose a speaker-filling strategy to reduce speaker identification errors and adopt an inter-CTC strategy to enhance the encoder's ability in acoustic modeling. Experiments on the AliMeeting corpus show that our model outperforms the cascaded SA-ASR model by a 6.1% relative speaker-dependent character error rate (SD-CER) reduction on the test set. Moreover, our model achieves a comparable SD-CER of 34.8% with only 1/10 RTF compared with the SOTA joint AR SA-ASR model.",

keywords = "AliMeeting, multi-speaker ASR, non-autoregressive, Speaker-attributed ASR",

author = "Yangze Li and Fan Yu and Yuhao Liang and Pengcheng Guo and Mohan Shi and Zhihao Du and Shiliang Zhang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 ; Conference date: 16-12-2023 Through 20-12-2023",

year = "2023",

doi = "10.1109/ASRU57964.2023.10389762",

language = "英语",

series = "2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023",

}

Li, Y, Yu, F, Liang, Y, Guo, P, Shi, M, Du, Z, Zhang, S & Xie, L 2023, Sa-Paraformer: Non-Autoregressive End-To-End Speaker-Attributed ASR. 在 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023. 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023, Institute of Electrical and Electronics Engineers Inc., 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023, Taipei, 中国台湾, 16/12/23. https://doi.org/10.1109/ASRU57964.2023.10389762

Sa-Paraformer: Non-Autoregressive End-To-End Speaker-Attributed ASR. / Li, Yangze; Yu, Fan; Liang, Yuhao 等.
2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023. Institute of Electrical and Electronics Engineers Inc., 2023. (2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Sa-Paraformer

T2 - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

AU - Li, Yangze

AU - Yu, Fan

AU - Liang, Yuhao

AU - Guo, Pengcheng

AU - Shi, Mohan

AU - Du, Zhihao

AU - Zhang, Shiliang

AU - Xie, Lei

PY - 2023

Y1 - 2023

N2 - Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR). Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we introduce a recently proposed non-autoregressive model Paraformer as an acoustic model in the SA-ASR model. Paraformer uses a single-step decoder to enable parallel generation, obtaining comparable performance to the SOTA AR transformer models. Besides, we propose a speaker-filling strategy to reduce speaker identification errors and adopt an inter-CTC strategy to enhance the encoder's ability in acoustic modeling. Experiments on the AliMeeting corpus show that our model outperforms the cascaded SA-ASR model by a 6.1% relative speaker-dependent character error rate (SD-CER) reduction on the test set. Moreover, our model achieves a comparable SD-CER of 34.8% with only 1/10 RTF compared with the SOTA joint AR SA-ASR model.

AB - Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR). Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we introduce a recently proposed non-autoregressive model Paraformer as an acoustic model in the SA-ASR model. Paraformer uses a single-step decoder to enable parallel generation, obtaining comparable performance to the SOTA AR transformer models. Besides, we propose a speaker-filling strategy to reduce speaker identification errors and adopt an inter-CTC strategy to enhance the encoder's ability in acoustic modeling. Experiments on the AliMeeting corpus show that our model outperforms the cascaded SA-ASR model by a 6.1% relative speaker-dependent character error rate (SD-CER) reduction on the test set. Moreover, our model achieves a comparable SD-CER of 34.8% with only 1/10 RTF compared with the SOTA joint AR SA-ASR model.

KW - AliMeeting

KW - multi-speaker ASR

KW - non-autoregressive

KW - Speaker-attributed ASR

UR - http://www.scopus.com/inward/record.url?scp=85184664740&partnerID=8YFLogxK

U2 - 10.1109/ASRU57964.2023.10389762

DO - 10.1109/ASRU57964.2023.10389762

M3 - 会议稿件

AN - SCOPUS:85184664740

T3 - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

BT - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 16 December 2023 through 20 December 2023

ER -

Li Y, Yu F, Liang Y, Guo P, Shi M, Du Z 等. Sa-Paraformer: Non-Autoregressive End-To-End Speaker-Attributed ASR. 在 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023. Institute of Electrical and Electronics Engineers Inc. 2023. (2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023). doi: 10.1109/ASRU57964.2023.10389762

Sa-Paraformer: Non-Autoregressive End-To-End Speaker-Attributed ASR

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此