Deep audio-visual system for closed-setword-level speech recognition

Yougen Yuan, Wei Tang, Minhao Fan, Yue Cao, Peng Zhang, Lei Xie

科研成果: 书/报告/会议事项章节会议稿件同行评审

1 引用 (Scopus)

摘要

Audio-visual understanding is usually challenged by the complementary gap between audio and visual informative bridging. Motivated by the recent audio-visual studies, a closed-set word-level speech recognition scheme is proposed for the Mandarin Audio-Visual Speech Recognition (MAVSR) Challenge in this study. To achieve respective audio and visual encoder initialization more effectively, a 3-dimensional convolutional neural network (CNN) and an attention-based bi-directional long short-term memory (Bi-LSTM) network are trained. With two fully connected layers in addition to the concatenated encoder outputs for the audio-visual joint training, the proposed scheme won the first place with a relative word accuracy improvement of 7.9% over the solitary audio system. Experiments on LRW-1000 dataset have substantially demonstrated that the proposed joint training scheme by audio-visual incorporation is capable of enhancing the recognition performance of relatively short duration samples, unveiling the multi-modal complementarity.

源语言英语
主期刊名ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction
编辑Wen Gao, Helen Mei Ling Meng, Matthew Turk, Susan R. Fussell, Bjorn Schuller, Bjorn Schuller, Yale Song, Kai Yu
出版商Association for Computing Machinery, Inc
540-545
页数6
ISBN(电子版)9781450368605
DOI
出版状态已出版 - 14 10月 2019
活动21st ACM International Conference on Multimodal Interaction, ICMI 2019 - Suzhou, 中国
期限: 14 10月 201918 10月 2019

出版系列

姓名ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction

会议

会议21st ACM International Conference on Multimodal Interaction, ICMI 2019
国家/地区中国
Suzhou
时期14/10/1918/10/19

指纹

探究 'Deep audio-visual system for closed-setword-level speech recognition' 的科研主题。它们共同构成独一无二的指纹。

引用此