Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR

Fan Yu; Haoneng Luo; Pengcheng Guo; Yuhao Liang; Zhuoyuan Yao; Lei Xie; Yingying Gao; Leijing Hou; Shilei Zhang

doi:10.1109/ASRU51503.2021.9688238

Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR

Fan Yu, Haoneng Luo, Pengcheng Guo, Yuhao Liang, Zhuoyuan Yao, Lei Xie, Yingying Gao, Leijing Hou, Shilei Zhang

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

11 Scopus citations

Abstract

Continuous integrate-and-fire (CIF) based models, which use a soft and monotonic alignment mechanism, have been well applied in non-autoregressive (NAR) speech recognition with competitive performance compared with other NAR methods. However, such an alignment learning strategy may suffer from an erroneous acoustic boundary estimation, severely hindering the convergence speed as well as the system performance. In this paper, we propose a boundary and context aware training approach for CIF based NAR models. Firstly, the connectionist temporal classification (CTC) spike information is utilized to guide the learning of acoustic boundaries in the CIF. Besides, an additional contextual decoder is introduced behind the CIF decoder, aiming to capture the linguistic dependencies within a sentence. Finally, we adopt a recently proposed Conformer architecture to improve the capacity of acoustic modeling. Experiments on the open-source Mandarin AISHELL-1 corpus show that the proposed method achieves a comparable character error rates (CERs) of 4.9% with only 1/24 latency compared with a state-of-the-art autoregressive (AR) Conformer model. Futhermore, when evaluating on an internal 7500 hours Mandarin corpus, our model still outperforms other NAR methods and even reaches the AR Conformer model on a challenging real-world noisy test set.

Original language	English
Title of host publication	2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	328-334
Number of pages	7
ISBN (Electronic)	9781665437394
DOIs	https://doi.org/10.1109/ASRU51503.2021.9688238
State	Published - 2021
Event	2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Cartagena, Colombia Duration: 13 Dec 2021 → 17 Dec 2021

Publication series

Name	2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings

Conference

Conference	2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021
Country/Territory	Colombia
City	Cartagena
Period	13/12/21 → 17/12/21

Keywords

continuous integrate-and-fire
end-to-end speech recognition
Non-autoregressive

Access to Document

10.1109/ASRU51503.2021.9688238

Cite this

Yu, F., Luo, H., Guo, P., Liang, Y., Yao, Z., Xie, L., Gao, Y., Hou, L., & Zhang, S. (2021). Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings (pp. 328-334). (2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASRU51503.2021.9688238

Yu, Fan ; Luo, Haoneng ; Guo, Pengcheng et al. / Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR. 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2021. pp. 328-334 (2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings).

@inproceedings{9144ccfc09a04fc2befb28943070d0ef,

title = "Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR",

abstract = "Continuous integrate-and-fire (CIF) based models, which use a soft and monotonic alignment mechanism, have been well applied in non-autoregressive (NAR) speech recognition with competitive performance compared with other NAR methods. However, such an alignment learning strategy may suffer from an erroneous acoustic boundary estimation, severely hindering the convergence speed as well as the system performance. In this paper, we propose a boundary and context aware training approach for CIF based NAR models. Firstly, the connectionist temporal classification (CTC) spike information is utilized to guide the learning of acoustic boundaries in the CIF. Besides, an additional contextual decoder is introduced behind the CIF decoder, aiming to capture the linguistic dependencies within a sentence. Finally, we adopt a recently proposed Conformer architecture to improve the capacity of acoustic modeling. Experiments on the open-source Mandarin AISHELL-1 corpus show that the proposed method achieves a comparable character error rates (CERs) of 4.9% with only 1/24 latency compared with a state-of-the-art autoregressive (AR) Conformer model. Futhermore, when evaluating on an internal 7500 hours Mandarin corpus, our model still outperforms other NAR methods and even reaches the AR Conformer model on a challenging real-world noisy test set.",

keywords = "continuous integrate-and-fire, end-to-end speech recognition, Non-autoregressive",

author = "Fan Yu and Haoneng Luo and Pengcheng Guo and Yuhao Liang and Zhuoyuan Yao and Lei Xie and Yingying Gao and Leijing Hou and Shilei Zhang",

note = "Publisher Copyright: {\textcopyright} 2021 IEEE.; 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 ; Conference date: 13-12-2021 Through 17-12-2021",

year = "2021",

doi = "10.1109/ASRU51503.2021.9688238",

language = "英语",

series = "2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "328--334",

booktitle = "2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings",

}

Yu, F, Luo, H, Guo, P, Liang, Y, Yao, Z, Xie, L, Gao, Y, Hou, L & Zhang, S 2021, Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR. in 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings. 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings, Institute of Electrical and Electronics Engineers Inc., pp. 328-334, 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, 13/12/21. https://doi.org/10.1109/ASRU51503.2021.9688238

Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR. / Yu, Fan; Luo, Haoneng; Guo, Pengcheng et al.
2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2021. p. 328-334 (2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR

AU - Yu, Fan

AU - Luo, Haoneng

AU - Guo, Pengcheng

AU - Liang, Yuhao

AU - Yao, Zhuoyuan

AU - Xie, Lei

AU - Gao, Yingying

AU - Hou, Leijing

AU - Zhang, Shilei

PY - 2021

Y1 - 2021

N2 - Continuous integrate-and-fire (CIF) based models, which use a soft and monotonic alignment mechanism, have been well applied in non-autoregressive (NAR) speech recognition with competitive performance compared with other NAR methods. However, such an alignment learning strategy may suffer from an erroneous acoustic boundary estimation, severely hindering the convergence speed as well as the system performance. In this paper, we propose a boundary and context aware training approach for CIF based NAR models. Firstly, the connectionist temporal classification (CTC) spike information is utilized to guide the learning of acoustic boundaries in the CIF. Besides, an additional contextual decoder is introduced behind the CIF decoder, aiming to capture the linguistic dependencies within a sentence. Finally, we adopt a recently proposed Conformer architecture to improve the capacity of acoustic modeling. Experiments on the open-source Mandarin AISHELL-1 corpus show that the proposed method achieves a comparable character error rates (CERs) of 4.9% with only 1/24 latency compared with a state-of-the-art autoregressive (AR) Conformer model. Futhermore, when evaluating on an internal 7500 hours Mandarin corpus, our model still outperforms other NAR methods and even reaches the AR Conformer model on a challenging real-world noisy test set.

AB - Continuous integrate-and-fire (CIF) based models, which use a soft and monotonic alignment mechanism, have been well applied in non-autoregressive (NAR) speech recognition with competitive performance compared with other NAR methods. However, such an alignment learning strategy may suffer from an erroneous acoustic boundary estimation, severely hindering the convergence speed as well as the system performance. In this paper, we propose a boundary and context aware training approach for CIF based NAR models. Firstly, the connectionist temporal classification (CTC) spike information is utilized to guide the learning of acoustic boundaries in the CIF. Besides, an additional contextual decoder is introduced behind the CIF decoder, aiming to capture the linguistic dependencies within a sentence. Finally, we adopt a recently proposed Conformer architecture to improve the capacity of acoustic modeling. Experiments on the open-source Mandarin AISHELL-1 corpus show that the proposed method achieves a comparable character error rates (CERs) of 4.9% with only 1/24 latency compared with a state-of-the-art autoregressive (AR) Conformer model. Futhermore, when evaluating on an internal 7500 hours Mandarin corpus, our model still outperforms other NAR methods and even reaches the AR Conformer model on a challenging real-world noisy test set.

KW - continuous integrate-and-fire

KW - end-to-end speech recognition

KW - Non-autoregressive

UR - http://www.scopus.com/inward/record.url?scp=85126791354&partnerID=8YFLogxK

U2 - 10.1109/ASRU51503.2021.9688238

DO - 10.1109/ASRU51503.2021.9688238

M3 - 会议稿件

AN - SCOPUS:85126791354

T3 - 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings

SP - 328

EP - 334

BT - 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021

Y2 - 13 December 2021 through 17 December 2021

ER -

Yu F, Luo H, Guo P, Liang Y, Yao Z, Xie L et al. Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2021. p. 328-334. (2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings). doi: 10.1109/ASRU51503.2021.9688238

Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this