A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings

Fan Yu; Zhihao Du; Shiliang Zhang; Yuxiao Lin; Lei Xie

doi:10.21437/Interspeech.2022-11210

A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings

Fan Yu, Zhihao Du, Shiliang Zhang, Yuxiao Lin, Lei Xie

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

11 引用（Scopus）

摘要

In this paper, we conduct a comparative study on speaker-attributed automatic speech recognition (SA-ASR) in the multiparty meeting scenario, a topic with increasing attention in meeting rich transcription. Specifically, three approaches are evaluated in this study. The first approach, FD-SOT, consists of a frame-level diarization model to identify speakers and a multi-talker ASR to recognize utterances. The speaker-attributed transcriptions are obtained by aligning the diarization results and the recognized hypotheses. However, due to the modular independence, such an alignment strategy may suffer from erroneous timestamps which severely hinder the model performance. Therefore, we propose the second approach, WD-SOT, to address alignment errors by introducing a word-level diarization model, which can get rid of such timestamp alignment dependency. To further mitigate the alignment issues, we propose the third approach, TS-ASR, which trains a target-speaker separation module and an ASR module jointly. By comparing various strategies for each SA-ASR approach, experimental results on a real meeting scenario corpus, AliMeeting, reveal that the WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER), compared with the FD-SOT approach. In addition, the TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.

源语言	英语
页（从-至）	560-564
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷	2022-September
DOI	https://doi.org/10.21437/Interspeech.2022-11210
出版状态	已出版 - 2022
活动	23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, 韩国期限: 18 9月 2022 → 22 9月 2022

访问文件

10.21437/Interspeech.2022-11210

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{ad0e7e9a2b0745e0acb19514b05925ae,

title = "A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings",

abstract = "In this paper, we conduct a comparative study on speaker-attributed automatic speech recognition (SA-ASR) in the multiparty meeting scenario, a topic with increasing attention in meeting rich transcription. Specifically, three approaches are evaluated in this study. The first approach, FD-SOT, consists of a frame-level diarization model to identify speakers and a multi-talker ASR to recognize utterances. The speaker-attributed transcriptions are obtained by aligning the diarization results and the recognized hypotheses. However, due to the modular independence, such an alignment strategy may suffer from erroneous timestamps which severely hinder the model performance. Therefore, we propose the second approach, WD-SOT, to address alignment errors by introducing a word-level diarization model, which can get rid of such timestamp alignment dependency. To further mitigate the alignment issues, we propose the third approach, TS-ASR, which trains a target-speaker separation module and an ASR module jointly. By comparing various strategies for each SA-ASR approach, experimental results on a real meeting scenario corpus, AliMeeting, reveal that the WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER), compared with the FD-SOT approach. In addition, the TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.",

keywords = "AliMeeting, multi-speaker ASR, rich transcription, speaker-attributed",

author = "Fan Yu and Zhihao Du and Shiliang Zhang and Yuxiao Lin and Lei Xie",

note = "Publisher Copyright: Copyright {\textcopyright} 2022 ISCA.; 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 ; Conference date: 18-09-2022 Through 22-09-2022",

year = "2022",

doi = "10.21437/Interspeech.2022-11210",

language = "英语",

volume = "2022-September",

pages = "560--564",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings

AU - Yu, Fan

AU - Du, Zhihao

AU - Zhang, Shiliang

AU - Lin, Yuxiao

AU - Xie, Lei

PY - 2022

Y1 - 2022

N2 - In this paper, we conduct a comparative study on speaker-attributed automatic speech recognition (SA-ASR) in the multiparty meeting scenario, a topic with increasing attention in meeting rich transcription. Specifically, three approaches are evaluated in this study. The first approach, FD-SOT, consists of a frame-level diarization model to identify speakers and a multi-talker ASR to recognize utterances. The speaker-attributed transcriptions are obtained by aligning the diarization results and the recognized hypotheses. However, due to the modular independence, such an alignment strategy may suffer from erroneous timestamps which severely hinder the model performance. Therefore, we propose the second approach, WD-SOT, to address alignment errors by introducing a word-level diarization model, which can get rid of such timestamp alignment dependency. To further mitigate the alignment issues, we propose the third approach, TS-ASR, which trains a target-speaker separation module and an ASR module jointly. By comparing various strategies for each SA-ASR approach, experimental results on a real meeting scenario corpus, AliMeeting, reveal that the WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER), compared with the FD-SOT approach. In addition, the TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.

AB - In this paper, we conduct a comparative study on speaker-attributed automatic speech recognition (SA-ASR) in the multiparty meeting scenario, a topic with increasing attention in meeting rich transcription. Specifically, three approaches are evaluated in this study. The first approach, FD-SOT, consists of a frame-level diarization model to identify speakers and a multi-talker ASR to recognize utterances. The speaker-attributed transcriptions are obtained by aligning the diarization results and the recognized hypotheses. However, due to the modular independence, such an alignment strategy may suffer from erroneous timestamps which severely hinder the model performance. Therefore, we propose the second approach, WD-SOT, to address alignment errors by introducing a word-level diarization model, which can get rid of such timestamp alignment dependency. To further mitigate the alignment issues, we propose the third approach, TS-ASR, which trains a target-speaker separation module and an ASR module jointly. By comparing various strategies for each SA-ASR approach, experimental results on a real meeting scenario corpus, AliMeeting, reveal that the WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER), compared with the FD-SOT approach. In addition, the TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.

KW - AliMeeting

KW - multi-speaker ASR

KW - rich transcription

KW - speaker-attributed

UR - http://www.scopus.com/inward/record.url?scp=85140048979&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2022-11210

DO - 10.21437/Interspeech.2022-11210

M3 - 会议文章

AN - SCOPUS:85140048979

SN - 2308-457X

VL - 2022-September

SP - 560

EP - 564

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022

Y2 - 18 September 2022 through 22 September 2022

ER -

A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings

摘要

访问文件

其它文件与链接

指纹

引用此