BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

Yuhao Liang; Fan Yu; Yangze Li; Pengcheng Guo; Shiliang Zhang; Qian Chen; Lei Xie

doi:10.21437/Interspeech.2023-1521

BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

Yuhao Liang, Fan Yu, Yangze Li, Pengcheng Guo, Shiliang Zhang, Qian Chen, Lei Xie

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

5 引用（Scopus）

摘要

The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundary-aware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.

源语言	英语
页（从-至）	3487-3491
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷	2023-August
DOI	https://doi.org/10.21437/Interspeech.2023-1521
出版状态	已出版 - 2023
活动	24th International Speech Communication Association, Interspeech 2023 - Dublin, 爱尔兰期限: 20 8月 2023 → 24 8月 2023

访问文件

10.21437/Interspeech.2023-1521

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{35728a441c284099a292bca060f5d3f8,

title = "BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR",

abstract = "The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundary-aware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.",

keywords = "automatic speech recognition, multi-talker, multi-task learning",

author = "Yuhao Liang and Fan Yu and Yangze Li and Pengcheng Guo and Shiliang Zhang and Qian Chen and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2023 International Speech Communication Association. All rights reserved.; 24th International Speech Communication Association, Interspeech 2023 ; Conference date: 20-08-2023 Through 24-08-2023",

year = "2023",

doi = "10.21437/Interspeech.2023-1521",

language = "英语",

volume = "2023-August",

pages = "3487--3491",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - BA-SOT

T2 - 24th International Speech Communication Association, Interspeech 2023

AU - Liang, Yuhao

AU - Yu, Fan

AU - Li, Yangze

AU - Guo, Pengcheng

AU - Zhang, Shiliang

AU - Chen, Qian

AU - Xie, Lei

PY - 2023

Y1 - 2023

N2 - The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundary-aware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.

AB - The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundary-aware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.

KW - automatic speech recognition

KW - multi-talker

KW - multi-task learning

UR - http://www.scopus.com/inward/record.url?scp=85171530428&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2023-1521

DO - 10.21437/Interspeech.2023-1521

M3 - 会议文章

AN - SCOPUS:85171530428

SN - 2308-457X

VL - 2023-August

SP - 3487

EP - 3491

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Y2 - 20 August 2023 through 24 August 2023

ER -

BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

摘要

访问文件

其它文件与链接

指纹

引用此