Multi-Speaker ASR combining non-autoregressive conformer CTC and conditional speaker chain

Pengcheng Guo; Xuankai Chang; Shinji Watanabe; Lei Xie

doi:10.21437/Interspeech.2021-2155

Multi-Speaker ASR combining non-autoregressive conformer CTC and conditional speaker chain

Pengcheng Guo, Xuankai Chang, Shinji Watanabe, Lei Xie

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

14 引用（Scopus）

摘要

Non-autoregressive (NAR) models have achieved a large inference computation reduction and comparable results with autoregressive (AR) models on various sequence to sequence tasks. However, there has been limited research aiming to explore the NAR approaches on sequence to multi-sequence problems, like multi-speaker automatic speech recognition (ASR). In this study, we extend our proposed conditional chain model to NAR multi-speaker ASR. Specifically, the output of each speaker is inferred one-by-one using both the input mixture speech and previously-estimated conditional speaker features. In each step, a NAR connectionist temporal classification (CTC) encoder is used to perform parallel computation. With this design, the total inference steps will be restricted to the number of mixed speakers. Besides, we also adopt the Conformer and incorporate an intermediate CTC loss to improve the performance. Experiments on WSJ0-Mix and LibriMix corpora show that our model outperforms other NAR models with only a slight increase of latency, achieving WERs of 22.3% and 24.9%, respectively. Moreover, by including the data of variable numbers of speakers, our model can even better than the PIT-Conformer AR model with only 1/7 latency, obtaining WERs of 19.9% and 34.3% on WSJ0-2mix and WSJ0-3mix sets. All of our codes are publicly available at https://github.com/pengchengguo/espnet/tree/conditionalmultispk.

源语言	英语
主期刊名	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
出版商	International Speech Communication Association
页	1401-1405
页数	5
ISBN（电子版）	9781713836902
DOI	https://doi.org/10.21437/Interspeech.2021-2155
出版状态	已出版 - 2021
活动	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 - Brno, 捷克共和国期限: 30 8月 2021 → 3 9月 2021

出版系列

姓名	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷	2
ISSN（印刷版）	2308-457X
ISSN（电子版）	1990-9772

会议

会议	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
国家/地区	捷克共和国
市	Brno
时期	30/08/21 → 3/09/21

访问文件

10.21437/Interspeech.2021-2155

其它文件与链接

链接到 Scopus 的出版物

引用此

Guo, P., Chang, X., Watanabe, S., & Xie, L. (2021). Multi-Speaker ASR combining non-autoregressive conformer CTC and conditional speaker chain. 在 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 (页码 1401-1405). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; 卷 2). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2021-2155

Guo, Pengcheng ; Chang, Xuankai ; Watanabe, Shinji 等. / Multi-Speaker ASR combining non-autoregressive conformer CTC and conditional speaker chain. 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association, 2021. 页码 1401-1405 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH).

@inproceedings{3e37016872d54c53851244e0530009dc,

title = "Multi-Speaker ASR combining non-autoregressive conformer CTC and conditional speaker chain",

abstract = "Non-autoregressive (NAR) models have achieved a large inference computation reduction and comparable results with autoregressive (AR) models on various sequence to sequence tasks. However, there has been limited research aiming to explore the NAR approaches on sequence to multi-sequence problems, like multi-speaker automatic speech recognition (ASR). In this study, we extend our proposed conditional chain model to NAR multi-speaker ASR. Specifically, the output of each speaker is inferred one-by-one using both the input mixture speech and previously-estimated conditional speaker features. In each step, a NAR connectionist temporal classification (CTC) encoder is used to perform parallel computation. With this design, the total inference steps will be restricted to the number of mixed speakers. Besides, we also adopt the Conformer and incorporate an intermediate CTC loss to improve the performance. Experiments on WSJ0-Mix and LibriMix corpora show that our model outperforms other NAR models with only a slight increase of latency, achieving WERs of 22.3% and 24.9%, respectively. Moreover, by including the data of variable numbers of speakers, our model can even better than the PIT-Conformer AR model with only 1/7 latency, obtaining WERs of 19.9% and 34.3% on WSJ0-2mix and WSJ0-3mix sets. All of our codes are publicly available at https://github.com/pengchengguo/espnet/tree/conditionalmultispk.",

keywords = "Conditional chain model, Multi-speaker speech recognition, Non-autoregressive",

author = "Pengcheng Guo and Xuankai Chang and Shinji Watanabe and Lei Xie",

note = "Publisher Copyright: Copyright {\textcopyright} 2021 ISCA.; 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 ; Conference date: 30-08-2021 Through 03-09-2021",

year = "2021",

doi = "10.21437/Interspeech.2021-2155",

language = "英语",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

publisher = "International Speech Communication Association",

pages = "1401--1405",

booktitle = "22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021",

}

Guo, P, Chang, X, Watanabe, S & Xie, L 2021, Multi-Speaker ASR combining non-autoregressive conformer CTC and conditional speaker chain. 在 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 卷 2, International Speech Communication Association, 页码 1401-1405, 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, Brno, 捷克共和国, 30/08/21. https://doi.org/10.21437/Interspeech.2021-2155

Multi-Speaker ASR combining non-autoregressive conformer CTC and conditional speaker chain. / Guo, Pengcheng; Chang, Xuankai; Watanabe, Shinji 等.
22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association, 2021. 页码 1401-1405 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; 卷 2).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Multi-Speaker ASR combining non-autoregressive conformer CTC and conditional speaker chain

AU - Guo, Pengcheng

AU - Chang, Xuankai

AU - Watanabe, Shinji

AU - Xie, Lei

PY - 2021

Y1 - 2021

N2 - Non-autoregressive (NAR) models have achieved a large inference computation reduction and comparable results with autoregressive (AR) models on various sequence to sequence tasks. However, there has been limited research aiming to explore the NAR approaches on sequence to multi-sequence problems, like multi-speaker automatic speech recognition (ASR). In this study, we extend our proposed conditional chain model to NAR multi-speaker ASR. Specifically, the output of each speaker is inferred one-by-one using both the input mixture speech and previously-estimated conditional speaker features. In each step, a NAR connectionist temporal classification (CTC) encoder is used to perform parallel computation. With this design, the total inference steps will be restricted to the number of mixed speakers. Besides, we also adopt the Conformer and incorporate an intermediate CTC loss to improve the performance. Experiments on WSJ0-Mix and LibriMix corpora show that our model outperforms other NAR models with only a slight increase of latency, achieving WERs of 22.3% and 24.9%, respectively. Moreover, by including the data of variable numbers of speakers, our model can even better than the PIT-Conformer AR model with only 1/7 latency, obtaining WERs of 19.9% and 34.3% on WSJ0-2mix and WSJ0-3mix sets. All of our codes are publicly available at https://github.com/pengchengguo/espnet/tree/conditionalmultispk.

AB - Non-autoregressive (NAR) models have achieved a large inference computation reduction and comparable results with autoregressive (AR) models on various sequence to sequence tasks. However, there has been limited research aiming to explore the NAR approaches on sequence to multi-sequence problems, like multi-speaker automatic speech recognition (ASR). In this study, we extend our proposed conditional chain model to NAR multi-speaker ASR. Specifically, the output of each speaker is inferred one-by-one using both the input mixture speech and previously-estimated conditional speaker features. In each step, a NAR connectionist temporal classification (CTC) encoder is used to perform parallel computation. With this design, the total inference steps will be restricted to the number of mixed speakers. Besides, we also adopt the Conformer and incorporate an intermediate CTC loss to improve the performance. Experiments on WSJ0-Mix and LibriMix corpora show that our model outperforms other NAR models with only a slight increase of latency, achieving WERs of 22.3% and 24.9%, respectively. Moreover, by including the data of variable numbers of speakers, our model can even better than the PIT-Conformer AR model with only 1/7 latency, obtaining WERs of 19.9% and 34.3% on WSJ0-2mix and WSJ0-3mix sets. All of our codes are publicly available at https://github.com/pengchengguo/espnet/tree/conditionalmultispk.

KW - Conditional chain model

KW - Multi-speaker speech recognition

KW - Non-autoregressive

UR - http://www.scopus.com/inward/record.url?scp=85119170486&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2021-2155

DO - 10.21437/Interspeech.2021-2155

M3 - 会议稿件

AN - SCOPUS:85119170486

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 1401

EP - 1405

BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021

PB - International Speech Communication Association

T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021

Y2 - 30 August 2021 through 3 September 2021

ER -

Guo P, Chang X, Watanabe S, Xie L. Multi-Speaker ASR combining non-autoregressive conformer CTC and conditional speaker chain. 在 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association. 2021. 页码 1401-1405. (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH). doi: 10.21437/Interspeech.2021-2155

Multi-Speaker ASR combining non-autoregressive conformer CTC and conditional speaker chain

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此