Skip to main navigation Skip to search Skip to main content

BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

  • Yuhao Liang
  • , Fan Yu
  • , Yangze Li
  • , Pengcheng Guo
  • , Shiliang Zhang
  • , Qian Chen
  • , Lei Xie
  • Northwestern Polytechnical University Xian
  • Alibaba Group Holding Ltd.

Research output: Contribution to journalConference articlepeer-review

9 Scopus citations

Abstract

The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundary-aware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.

Original languageEnglish
Pages (from-to)3487-3491
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2023-August
DOIs
StatePublished - 2023
Event24th Annual conference of the International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023

Keywords

  • automatic speech recognition
  • multi-talker
  • multi-task learning

Fingerprint

Dive into the research topics of 'BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR'. Together they form a unique fingerprint.

Cite this