Deep temporal architecture for audiovisual speech recognition

Chunlin Tian, Yuan Yuan, Xiaoqiang Lu

科研成果: 书/报告/会议事项章节会议稿件同行评审

3 引用 (Scopus)

摘要

The Audiovisual Speech Recognition (AVSR) is one of the applications of multimodal machine learning related to speech recognition, lipreading systems and video classification. In recent and related work, increasing efforts are made in Deep Neural Network (DNN) for AVSR, moreover some DNN models including Multimodal Deep Autoencoder, Multimodal Deep Belief Network and Multimodal Deep Boltzmann Machine perform well in experiments owing to the better generalization and nonlinear transformation. However, these DNN models have several disadvantages: (1) They mainly deal with modal fusion while ignoring temporal fusion. (2) Traditional methods fail to consider the connection among frames in the modal fusion. (3) These models aren’t end-to-end structure. We propose a deep temporal architecture, which has not only classical modal fusion, but temporal modal fusion and temporal fusion. Furthermore, the overfitting and learning with small size samples in the AVSR are also studied, so that we propose a set of useful training strategies. The experiments show the superiority of our model and necessity of the training strategies in three datasets: AVLetters, AVLetters2, AVDigits. In the end, we conclude the work.

源语言英语
主期刊名Computer Vision - 2nd CCF Chinese Conference, CCCV 2017, Proceedings
编辑Xiang Bai, Qinghua Hu, Liang Wang, Qingshan Liu, Jinfeng Yang, Ming-Ming Cheng, Deyu Meng
出版商Springer Verlag
650-661
页数12
ISBN(印刷版)9789811072987
DOI
出版状态已出版 - 2017
已对外发布
活动2nd Chinese Conference on Computer Vision, CCCV 2017 - Tianjin, 中国
期限: 11 10月 201714 10月 2017

出版系列

姓名Communications in Computer and Information Science
771
ISSN(印刷版)1865-0929

会议

会议2nd Chinese Conference on Computer Vision, CCCV 2017
国家/地区中国
Tianjin
时期11/10/1714/10/17

指纹

探究 'Deep temporal architecture for audiovisual speech recognition' 的科研主题。它们共同构成独一无二的指纹。

引用此