Sa-Paraformer: Non-Autoregressive End-To-End Speaker-Attributed ASR

Yangze Li, Fan Yu, Yuhao Liang, Pengcheng Guo, Mohan Shi, Zhihao Du, Shiliang Zhang, Lei Xie

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR). Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we introduce a recently proposed non-autoregressive model Paraformer as an acoustic model in the SA-ASR model. Paraformer uses a single-step decoder to enable parallel generation, obtaining comparable performance to the SOTA AR transformer models. Besides, we propose a speaker-filling strategy to reduce speaker identification errors and adopt an inter-CTC strategy to enhance the encoder's ability in acoustic modeling. Experiments on the AliMeeting corpus show that our model outperforms the cascaded SA-ASR model by a 6.1% relative speaker-dependent character error rate (SD-CER) reduction on the test set. Moreover, our model achieves a comparable SD-CER of 34.8% with only 1/10 RTF compared with the SOTA joint AR SA-ASR model.

Original languageEnglish
Title of host publication2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350306897
DOIs
StatePublished - 2023
Event2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 - Taipei, Taiwan, Province of China
Duration: 16 Dec 202320 Dec 2023

Publication series

Name2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

Conference

Conference2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
Country/TerritoryTaiwan, Province of China
CityTaipei
Period16/12/2320/12/23

Keywords

  • AliMeeting
  • multi-speaker ASR
  • non-autoregressive
  • Speaker-attributed ASR

Fingerprint

Dive into the research topics of 'Sa-Paraformer: Non-Autoregressive End-To-End Speaker-Attributed ASR'. Together they form a unique fingerprint.

Cite this