Deep audio-visual system for closed-setword-level speech recognition

Yougen Yuan, Wei Tang, Minhao Fan, Yue Cao, Peng Zhang, Lei Xie

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Audio-visual understanding is usually challenged by the complementary gap between audio and visual informative bridging. Motivated by the recent audio-visual studies, a closed-set word-level speech recognition scheme is proposed for the Mandarin Audio-Visual Speech Recognition (MAVSR) Challenge in this study. To achieve respective audio and visual encoder initialization more effectively, a 3-dimensional convolutional neural network (CNN) and an attention-based bi-directional long short-term memory (Bi-LSTM) network are trained. With two fully connected layers in addition to the concatenated encoder outputs for the audio-visual joint training, the proposed scheme won the first place with a relative word accuracy improvement of 7.9% over the solitary audio system. Experiments on LRW-1000 dataset have substantially demonstrated that the proposed joint training scheme by audio-visual incorporation is capable of enhancing the recognition performance of relatively short duration samples, unveiling the multi-modal complementarity.

Original languageEnglish
Title of host publicationICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction
EditorsWen Gao, Helen Mei Ling Meng, Matthew Turk, Susan R. Fussell, Bjorn Schuller, Bjorn Schuller, Yale Song, Kai Yu
PublisherAssociation for Computing Machinery, Inc
Pages540-545
Number of pages6
ISBN (Electronic)9781450368605
DOIs
StatePublished - 14 Oct 2019
Event21st ACM International Conference on Multimodal Interaction, ICMI 2019 - Suzhou, China
Duration: 14 Oct 201918 Oct 2019

Publication series

NameICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction

Conference

Conference21st ACM International Conference on Multimodal Interaction, ICMI 2019
Country/TerritoryChina
CitySuzhou
Period14/10/1918/10/19

Keywords

  • Audio-visual
  • Convolutional neural network
  • Long short-term memory
  • Multi-model

Fingerprint

Dive into the research topics of 'Deep audio-visual system for closed-setword-level speech recognition'. Together they form a unique fingerprint.

Cite this