TY - GEN
T1 - Exploring RNN-Transducer for Chinese speech recognition
AU - Wang, Senmao
AU - Zhou, Pan
AU - Chen, Wei
AU - Jia, Jia
AU - Xie, Lei
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/11
Y1 - 2019/11
N2 - End-to-end approaches have drawn much attention recently for significantly simplifying the construction of an automatic speech recognition (ASR) system. RNN transducer (RNN-T) is one of the popular end- to-end methods. Previous studies have shown that RNN-T is difficult to train and a very complex training process is needed for a reasonable performance. In this paper, we explore RNN-T for a Chinese large vocabulary continuous speech recognition (LVCSR) task and aim to simplify the training process while maintaining performance. First, a new strategy of learning rate decay is proposed to accelerate the model convergence. Second, we find that adding convolutional layers at the beginning of the network and using ordered data can discard the pre-training process of the encoder without loss of performance. Besides, we design experiments to find a balance among the usage of GPU memory, training circle and model performance. Finally, we achieve 16.9% character error rate (CER) on our test set, which is 2% absolute improvement from a strong BLSTM CE system with language model trained on the same text corpus.
AB - End-to-end approaches have drawn much attention recently for significantly simplifying the construction of an automatic speech recognition (ASR) system. RNN transducer (RNN-T) is one of the popular end- to-end methods. Previous studies have shown that RNN-T is difficult to train and a very complex training process is needed for a reasonable performance. In this paper, we explore RNN-T for a Chinese large vocabulary continuous speech recognition (LVCSR) task and aim to simplify the training process while maintaining performance. First, a new strategy of learning rate decay is proposed to accelerate the model convergence. Second, we find that adding convolutional layers at the beginning of the network and using ordered data can discard the pre-training process of the encoder without loss of performance. Besides, we design experiments to find a balance among the usage of GPU memory, training circle and model performance. Finally, we achieve 16.9% character error rate (CER) on our test set, which is 2% absolute improvement from a strong BLSTM CE system with language model trained on the same text corpus.
KW - Automatic speech recognition
KW - End- to-end speech recognition
KW - RNN-Tranducer
UR - https://www.scopus.com/pages/publications/85082389003
U2 - 10.1109/APSIPAASC47483.2019.9023133
DO - 10.1109/APSIPAASC47483.2019.9023133
M3 - 会议稿件
AN - SCOPUS:85082389003
T3 - 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019
SP - 1364
EP - 1369
BT - 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019
Y2 - 18 November 2019 through 21 November 2019
ER -