TY - GEN
T1 - Deep temporal architecture for audiovisual speech recognition
AU - Tian, Chunlin
AU - Yuan, Yuan
AU - Lu, Xiaoqiang
N1 - Publisher Copyright:
© Springer Nature Singapore Pte Ltd. 2017.
PY - 2017
Y1 - 2017
N2 - The Audiovisual Speech Recognition (AVSR) is one of the applications of multimodal machine learning related to speech recognition, lipreading systems and video classification. In recent and related work, increasing efforts are made in Deep Neural Network (DNN) for AVSR, moreover some DNN models including Multimodal Deep Autoencoder, Multimodal Deep Belief Network and Multimodal Deep Boltzmann Machine perform well in experiments owing to the better generalization and nonlinear transformation. However, these DNN models have several disadvantages: (1) They mainly deal with modal fusion while ignoring temporal fusion. (2) Traditional methods fail to consider the connection among frames in the modal fusion. (3) These models aren’t end-to-end structure. We propose a deep temporal architecture, which has not only classical modal fusion, but temporal modal fusion and temporal fusion. Furthermore, the overfitting and learning with small size samples in the AVSR are also studied, so that we propose a set of useful training strategies. The experiments show the superiority of our model and necessity of the training strategies in three datasets: AVLetters, AVLetters2, AVDigits. In the end, we conclude the work.
AB - The Audiovisual Speech Recognition (AVSR) is one of the applications of multimodal machine learning related to speech recognition, lipreading systems and video classification. In recent and related work, increasing efforts are made in Deep Neural Network (DNN) for AVSR, moreover some DNN models including Multimodal Deep Autoencoder, Multimodal Deep Belief Network and Multimodal Deep Boltzmann Machine perform well in experiments owing to the better generalization and nonlinear transformation. However, these DNN models have several disadvantages: (1) They mainly deal with modal fusion while ignoring temporal fusion. (2) Traditional methods fail to consider the connection among frames in the modal fusion. (3) These models aren’t end-to-end structure. We propose a deep temporal architecture, which has not only classical modal fusion, but temporal modal fusion and temporal fusion. Furthermore, the overfitting and learning with small size samples in the AVSR are also studied, so that we propose a set of useful training strategies. The experiments show the superiority of our model and necessity of the training strategies in three datasets: AVLetters, AVLetters2, AVDigits. In the end, we conclude the work.
KW - Audiovisual speech recognition
KW - Multimodal deep learning
UR - http://www.scopus.com/inward/record.url?scp=85037845041&partnerID=8YFLogxK
U2 - 10.1007/978-981-10-7299-4_54
DO - 10.1007/978-981-10-7299-4_54
M3 - 会议稿件
AN - SCOPUS:85037845041
SN - 9789811072987
T3 - Communications in Computer and Information Science
SP - 650
EP - 661
BT - Computer Vision - 2nd CCF Chinese Conference, CCCV 2017, Proceedings
A2 - Bai, Xiang
A2 - Hu, Qinghua
A2 - Wang, Liang
A2 - Liu, Qingshan
A2 - Yang, Jinfeng
A2 - Cheng, Ming-Ming
A2 - Meng, Deyu
PB - Springer Verlag
T2 - 2nd Chinese Conference on Computer Vision, CCCV 2017
Y2 - 11 October 2017 through 14 October 2017
ER -