Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition

Yuan Yuan, Chunlin Tian, Xiaoqiang Lu

科研成果: 期刊稿件文章同行评审

37 引用 (Scopus)

摘要

Audio-visual speech recognition (AVSR) utilizes both audio and video modalities for the robust automatic speech recognition. Most deep neural network (DNN) has achieved promising performances in AVSR owing to its generalized and nonlinear mapping ability. However, these DNN models have two main disadvantages: 1) the first disadvantage is that most models alleviate the AVSR problems neglecting the fact that the frames are correlated; and 2) the second disadvantage is the feature learned by the mentioned models is not credible. This is because the joint representation learned by the fusion fails to consider the specific information of categories, and the discriminative information is sparse, while the noise, reverberation, irrelevant image objection, and background are redundancy. Aiming at relieving these disadvantages, we propose the auxiliary loss multimodal GRU (alm-GRU) model including three parts: Feature extraction, data augmentation, and fusion recognition. The feature extraction and data augmentation are a complete effective solution for the processing raw complete video and training, and precondition for later core part: Fusion recognition using alm-GRU equipped with a novel loss which is an end-to-end network combining both fusion and recognition, furthermore considering the modal and temporal information. The experiments show the superiority of our model and necessity of the data augmentation and generative component in the benchmark data sets.

源语言英语
页(从-至)5573-5583
页数11
期刊IEEE Access
6
DOI
出版状态已出版 - 1 2月 2018
已对外发布

指纹

探究 'Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition' 的科研主题。它们共同构成独一无二的指纹。

引用此