TY - JOUR
T1 - Adversarial regularization for attention based end-to-end robust speech recognition
AU - Sun, Sining
AU - Guo, Pengcheng
AU - Xie, Lei
AU - Hwang, Mei Yuh
N1 - Publisher Copyright:
© 2019 Institute of Electrical and Electronics Engineers Inc.. All rights reserved.
PY - 2019/11
Y1 - 2019/11
N2 - End-to-end speech recognition, such as attention based approaches, is an emerging and attractive topic in recent years. It has achieved comparable performance with the traditional speech recognition framework. Because end-to-end approaches integrate acoustic and linguistic information into one model, the perturbation in the acoustic level such as acoustic noise, could be easily propagated to the linguistic level. Thus improving model robustness in real application environments for these end-to-end systems is crucial. In this paper, in order to make the attention based end-to-end model more robust against noises, we formulate regulation of the objective function with adversarial training examples. Particularly two adversarial regularization techniques, the fast gradient-sign method and the local distributional smoothness method, are explored to improve noise robustness. Experiments on two publicly available Chinese Mandarin corpora, AISHELL-1 and AISHELL-2, show that adversarial regularization is an effective approach to improve robustness against noises for our attention-based models. Specifically, we obtained 18.4 relative character error rate CER reduction on the AISHELL-1 noisy test set. Even on the clean test set, we showed 16.7 relative improvement. As the training set increases and covers more environmental varieties, our proposed methods remain effective despite that the improvement shrinks. Training on the large AISHELL-2 training corpus and testing on the various AISHELL-2 test sets, we achieved 7.0-12.2 relative error rate reduction. To our knowledge, this is the first successful application of adversarial regularization to sequence-to-sequence speech recognition systems.
AB - End-to-end speech recognition, such as attention based approaches, is an emerging and attractive topic in recent years. It has achieved comparable performance with the traditional speech recognition framework. Because end-to-end approaches integrate acoustic and linguistic information into one model, the perturbation in the acoustic level such as acoustic noise, could be easily propagated to the linguistic level. Thus improving model robustness in real application environments for these end-to-end systems is crucial. In this paper, in order to make the attention based end-to-end model more robust against noises, we formulate regulation of the objective function with adversarial training examples. Particularly two adversarial regularization techniques, the fast gradient-sign method and the local distributional smoothness method, are explored to improve noise robustness. Experiments on two publicly available Chinese Mandarin corpora, AISHELL-1 and AISHELL-2, show that adversarial regularization is an effective approach to improve robustness against noises for our attention-based models. Specifically, we obtained 18.4 relative character error rate CER reduction on the AISHELL-1 noisy test set. Even on the clean test set, we showed 16.7 relative improvement. As the training set increases and covers more environmental varieties, our proposed methods remain effective despite that the improvement shrinks. Training on the large AISHELL-2 training corpus and testing on the various AISHELL-2 test sets, we achieved 7.0-12.2 relative error rate reduction. To our knowledge, this is the first successful application of adversarial regularization to sequence-to-sequence speech recognition systems.
KW - Adversarial training
KW - Attention
KW - Cross entropy
KW - Listen Attend and Spell
KW - Sequence-to-sequence
KW - Virtual adversarial training
UR - http://www.scopus.com/inward/record.url?scp=85075011136&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2019.2933146
DO - 10.1109/TASLP.2019.2933146
M3 - 文章
AN - SCOPUS:85075011136
SN - 2329-9290
VL - 27
SP - 1826
EP - 1838
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
IS - 11
M1 - 3370726
ER -