Robust Audio-visual Speech Recognition Using Bimodal Dfsmn with Multi-condition Training and Dropout Regularization

Shiliang Zhang, Ming Lei, Bin Ma, Lei Xie

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

36 Scopus citations

Abstract

Audio-visual speech recognition (AVSR) is thought to be one of the potential solutions for robust speech recognition, especially in noisy environments. Compared to audio only speech recognition, the major issues of AVSR include the lack of publicly available audio-visual corpora and the need of robust knowledge fusion of both speech and vision. In this work, based on the recently released NTCD-TIMIT audio-visual corpus, we address the challenges of AVSR through three aspects: 1) optimal integration of acoustic and visual information; 2) robust performance with multi-condition training; 3) robust modeling against missing visual information during decoding. We propose a bimodal-DFSMN to jointly learn feature fusion and acoustic modeling, and utilize a per-frame dropout approach to enhance the robustness of AVSR system against the missing of visual modality. In the experiments, we construct two setups based on the NTCD-TIMIT corpus that consists of 5 hours clean training data and 150 hours multi-condition training data, respectively. As a result, we achieve a phone error rate of 12.6% on clean test set and an average phone error rate of 26.2% on all test sets (clean, various SNRs, various noise types), which both dramatically improve the baseline performance in NTCD-TIMIT task.

Original languageEnglish
Title of host publication2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages6570-6574
Number of pages5
ISBN (Electronic)9781479981311
DOIs
StatePublished - May 2019
Event44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Brighton, United Kingdom
Duration: 12 May 201917 May 2019

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2019-May
ISSN (Print)1520-6149

Conference

Conference44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
Country/TerritoryUnited Kingdom
CityBrighton
Period12/05/1917/05/19

Keywords

  • Audio-visual speech recognition
  • bimodal DF-SMN
  • dropout
  • multi-condition training
  • robust speech recognition

Fingerprint

Dive into the research topics of 'Robust Audio-visual Speech Recognition Using Bimodal Dfsmn with Multi-condition Training and Dropout Regularization'. Together they form a unique fingerprint.

Cite this