Deep temporal architecture for audiovisual speech recognition

Chunlin Tian, Yuan Yuan, Xiaoqiang Lu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

The Audiovisual Speech Recognition (AVSR) is one of the applications of multimodal machine learning related to speech recognition, lipreading systems and video classification. In recent and related work, increasing efforts are made in Deep Neural Network (DNN) for AVSR, moreover some DNN models including Multimodal Deep Autoencoder, Multimodal Deep Belief Network and Multimodal Deep Boltzmann Machine perform well in experiments owing to the better generalization and nonlinear transformation. However, these DNN models have several disadvantages: (1) They mainly deal with modal fusion while ignoring temporal fusion. (2) Traditional methods fail to consider the connection among frames in the modal fusion. (3) These models aren’t end-to-end structure. We propose a deep temporal architecture, which has not only classical modal fusion, but temporal modal fusion and temporal fusion. Furthermore, the overfitting and learning with small size samples in the AVSR are also studied, so that we propose a set of useful training strategies. The experiments show the superiority of our model and necessity of the training strategies in three datasets: AVLetters, AVLetters2, AVDigits. In the end, we conclude the work.

Original languageEnglish
Title of host publicationComputer Vision - 2nd CCF Chinese Conference, CCCV 2017, Proceedings
EditorsXiang Bai, Qinghua Hu, Liang Wang, Qingshan Liu, Jinfeng Yang, Ming-Ming Cheng, Deyu Meng
PublisherSpringer Verlag
Pages650-661
Number of pages12
ISBN (Print)9789811072987
DOIs
StatePublished - 2017
Externally publishedYes
Event2nd Chinese Conference on Computer Vision, CCCV 2017 - Tianjin, China
Duration: 11 Oct 201714 Oct 2017

Publication series

NameCommunications in Computer and Information Science
Volume771
ISSN (Print)1865-0929

Conference

Conference2nd Chinese Conference on Computer Vision, CCCV 2017
Country/TerritoryChina
CityTianjin
Period11/10/1714/10/17

Keywords

  • Audiovisual speech recognition
  • Multimodal deep learning

Fingerprint

Dive into the research topics of 'Deep temporal architecture for audiovisual speech recognition'. Together they form a unique fingerprint.

Cite this