Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition

Yuan Yuan; Chunlin Tian; Xiaoqiang Lu

doi:10.1109/ACCESS.2018.2796118

Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition

Yuan Yuan, Chunlin Tian, Xiaoqiang Lu

Research output: Contribution to journal › Article › peer-review

37 Scopus citations

Abstract

Audio-visual speech recognition (AVSR) utilizes both audio and video modalities for the robust automatic speech recognition. Most deep neural network (DNN) has achieved promising performances in AVSR owing to its generalized and nonlinear mapping ability. However, these DNN models have two main disadvantages: 1) the first disadvantage is that most models alleviate the AVSR problems neglecting the fact that the frames are correlated; and 2) the second disadvantage is the feature learned by the mentioned models is not credible. This is because the joint representation learned by the fusion fails to consider the specific information of categories, and the discriminative information is sparse, while the noise, reverberation, irrelevant image objection, and background are redundancy. Aiming at relieving these disadvantages, we propose the auxiliary loss multimodal GRU (alm-GRU) model including three parts: Feature extraction, data augmentation, and fusion recognition. The feature extraction and data augmentation are a complete effective solution for the processing raw complete video and training, and precondition for later core part: Fusion recognition using alm-GRU equipped with a novel loss which is an end-to-end network combining both fusion and recognition, furthermore considering the modal and temporal information. The experiments show the superiority of our model and necessity of the data augmentation and generative component in the benchmark data sets.

Original language	English
Pages (from-to)	5573-5583
Number of pages	11
Journal	IEEE Access
Volume	6
DOIs	https://doi.org/10.1109/ACCESS.2018.2796118
State	Published - 1 Feb 2018
Externally published	Yes

Keywords

Aduio-visual systems
generative adversarial networks
recurrent neural networks

Access to Document

10.1109/ACCESS.2018.2796118

Cite this

@article{13a96b620ec54bd59f5ca998722afd6e,

title = "Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition",

abstract = "Audio-visual speech recognition (AVSR) utilizes both audio and video modalities for the robust automatic speech recognition. Most deep neural network (DNN) has achieved promising performances in AVSR owing to its generalized and nonlinear mapping ability. However, these DNN models have two main disadvantages: 1) the first disadvantage is that most models alleviate the AVSR problems neglecting the fact that the frames are correlated; and 2) the second disadvantage is the feature learned by the mentioned models is not credible. This is because the joint representation learned by the fusion fails to consider the specific information of categories, and the discriminative information is sparse, while the noise, reverberation, irrelevant image objection, and background are redundancy. Aiming at relieving these disadvantages, we propose the auxiliary loss multimodal GRU (alm-GRU) model including three parts: Feature extraction, data augmentation, and fusion recognition. The feature extraction and data augmentation are a complete effective solution for the processing raw complete video and training, and precondition for later core part: Fusion recognition using alm-GRU equipped with a novel loss which is an end-to-end network combining both fusion and recognition, furthermore considering the modal and temporal information. The experiments show the superiority of our model and necessity of the data augmentation and generative component in the benchmark data sets.",

keywords = "Aduio-visual systems, generative adversarial networks, recurrent neural networks",

author = "Yuan Yuan and Chunlin Tian and Xiaoqiang Lu",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2018",

month = feb,

day = "1",

doi = "10.1109/ACCESS.2018.2796118",

language = "英语",

volume = "6",

pages = "5573--5583",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition

AU - Yuan, Yuan

AU - Tian, Chunlin

AU - Lu, Xiaoqiang

PY - 2018/2/1

Y1 - 2018/2/1

N2 - Audio-visual speech recognition (AVSR) utilizes both audio and video modalities for the robust automatic speech recognition. Most deep neural network (DNN) has achieved promising performances in AVSR owing to its generalized and nonlinear mapping ability. However, these DNN models have two main disadvantages: 1) the first disadvantage is that most models alleviate the AVSR problems neglecting the fact that the frames are correlated; and 2) the second disadvantage is the feature learned by the mentioned models is not credible. This is because the joint representation learned by the fusion fails to consider the specific information of categories, and the discriminative information is sparse, while the noise, reverberation, irrelevant image objection, and background are redundancy. Aiming at relieving these disadvantages, we propose the auxiliary loss multimodal GRU (alm-GRU) model including three parts: Feature extraction, data augmentation, and fusion recognition. The feature extraction and data augmentation are a complete effective solution for the processing raw complete video and training, and precondition for later core part: Fusion recognition using alm-GRU equipped with a novel loss which is an end-to-end network combining both fusion and recognition, furthermore considering the modal and temporal information. The experiments show the superiority of our model and necessity of the data augmentation and generative component in the benchmark data sets.

AB - Audio-visual speech recognition (AVSR) utilizes both audio and video modalities for the robust automatic speech recognition. Most deep neural network (DNN) has achieved promising performances in AVSR owing to its generalized and nonlinear mapping ability. However, these DNN models have two main disadvantages: 1) the first disadvantage is that most models alleviate the AVSR problems neglecting the fact that the frames are correlated; and 2) the second disadvantage is the feature learned by the mentioned models is not credible. This is because the joint representation learned by the fusion fails to consider the specific information of categories, and the discriminative information is sparse, while the noise, reverberation, irrelevant image objection, and background are redundancy. Aiming at relieving these disadvantages, we propose the auxiliary loss multimodal GRU (alm-GRU) model including three parts: Feature extraction, data augmentation, and fusion recognition. The feature extraction and data augmentation are a complete effective solution for the processing raw complete video and training, and precondition for later core part: Fusion recognition using alm-GRU equipped with a novel loss which is an end-to-end network combining both fusion and recognition, furthermore considering the modal and temporal information. The experiments show the superiority of our model and necessity of the data augmentation and generative component in the benchmark data sets.

KW - Aduio-visual systems

KW - generative adversarial networks

KW - recurrent neural networks

UR - http://www.scopus.com/inward/record.url?scp=85041652303&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2018.2796118

DO - 10.1109/ACCESS.2018.2796118

M3 - 文章

AN - SCOPUS:85041652303

SN - 2169-3536

VL - 6

SP - 5573

EP - 5583

JO - IEEE Access

JF - IEEE Access

ER -

Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this