Deep audio-visual system for closed-setword-level speech recognition

Yougen Yuan; Wei Tang; Minhao Fan; Yue Cao; Peng Zhang; Lei Xie

doi:10.1145/3340555.3356102

Deep audio-visual system for closed-setword-level speech recognition

Yougen Yuan, Wei Tang, Minhao Fan, Yue Cao, Peng Zhang, Lei Xie

School of Computer Science

Northwestern Polytechnical University Xian

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

1 Scopus citations

Abstract

Audio-visual understanding is usually challenged by the complementary gap between audio and visual informative bridging. Motivated by the recent audio-visual studies, a closed-set word-level speech recognition scheme is proposed for the Mandarin Audio-Visual Speech Recognition (MAVSR) Challenge in this study. To achieve respective audio and visual encoder initialization more effectively, a 3-dimensional convolutional neural network (CNN) and an attention-based bi-directional long short-term memory (Bi-LSTM) network are trained. With two fully connected layers in addition to the concatenated encoder outputs for the audio-visual joint training, the proposed scheme won the first place with a relative word accuracy improvement of 7.9% over the solitary audio system. Experiments on LRW-1000 dataset have substantially demonstrated that the proposed joint training scheme by audio-visual incorporation is capable of enhancing the recognition performance of relatively short duration samples, unveiling the multi-modal complementarity.

Original language	English
Title of host publication	ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction
Editors	Wen Gao, Helen Mei Ling Meng, Matthew Turk, Susan R. Fussell, Bjorn Schuller, Bjorn Schuller, Yale Song, Kai Yu
Publisher	Association for Computing Machinery, Inc
Pages	540-545
Number of pages	6
ISBN (Electronic)	9781450368605
DOIs	https://doi.org/10.1145/3340555.3356102
State	Published - 14 Oct 2019
Event	21st ACM International Conference on Multimodal Interaction, ICMI 2019 - Suzhou, China Duration: 14 Oct 2019 → 18 Oct 2019

Publication series

Name	ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction

Conference

Conference	21st ACM International Conference on Multimodal Interaction, ICMI 2019
Country/Territory	China
City	Suzhou
Period	14/10/19 → 18/10/19

Keywords

Audio-visual
Convolutional neural network
Long short-term memory
Multi-model

Access to Document

10.1145/3340555.3356102

Cite this

Yuan, Y., Tang, W., Fan, M., Cao, Y., Zhang, P., & Xie, L. (2019). Deep audio-visual system for closed-setword-level speech recognition. In W. Gao, H. M. Ling Meng, M. Turk, S. R. Fussell, B. Schuller, B. Schuller, Y. Song, & K. Yu (Eds.), ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction (pp. 540-545). (ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction). Association for Computing Machinery, Inc. https://doi.org/10.1145/3340555.3356102

Yuan, Yougen ; Tang, Wei ; Fan, Minhao et al. / Deep audio-visual system for closed-setword-level speech recognition. ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction. editor / Wen Gao ; Helen Mei Ling Meng ; Matthew Turk ; Susan R. Fussell ; Bjorn Schuller ; Bjorn Schuller ; Yale Song ; Kai Yu. Association for Computing Machinery, Inc, 2019. pp. 540-545 (ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction).

@inproceedings{92c8ea8d35484bc49f762c4b750c629f,

title = "Deep audio-visual system for closed-setword-level speech recognition",

abstract = "Audio-visual understanding is usually challenged by the complementary gap between audio and visual informative bridging. Motivated by the recent audio-visual studies, a closed-set word-level speech recognition scheme is proposed for the Mandarin Audio-Visual Speech Recognition (MAVSR) Challenge in this study. To achieve respective audio and visual encoder initialization more effectively, a 3-dimensional convolutional neural network (CNN) and an attention-based bi-directional long short-term memory (Bi-LSTM) network are trained. With two fully connected layers in addition to the concatenated encoder outputs for the audio-visual joint training, the proposed scheme won the first place with a relative word accuracy improvement of 7.9% over the solitary audio system. Experiments on LRW-1000 dataset have substantially demonstrated that the proposed joint training scheme by audio-visual incorporation is capable of enhancing the recognition performance of relatively short duration samples, unveiling the multi-modal complementarity.",

keywords = "Audio-visual, Convolutional neural network, Long short-term memory, Multi-model",

author = "Yougen Yuan and Wei Tang and Minhao Fan and Yue Cao and Peng Zhang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2019 Association for Computing Machinery.; 21st ACM International Conference on Multimodal Interaction, ICMI 2019 ; Conference date: 14-10-2019 Through 18-10-2019",

year = "2019",

month = oct,

day = "14",

doi = "10.1145/3340555.3356102",

language = "英语",

series = "ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction",

publisher = "Association for Computing Machinery, Inc",

pages = "540--545",

editor = "Wen Gao and {Ling Meng}, {Helen Mei} and Matthew Turk and Fussell, {Susan R.} and Bjorn Schuller and Bjorn Schuller and Yale Song and Kai Yu",

booktitle = "ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction",

}

Yuan, Y, Tang, W, Fan, M, Cao, Y, Zhang, P & Xie, L 2019, Deep audio-visual system for closed-setword-level speech recognition. in W Gao, HM Ling Meng, M Turk, SR Fussell, B Schuller, B Schuller, Y Song & K Yu (eds), ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction. ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction, Association for Computing Machinery, Inc, pp. 540-545, 21st ACM International Conference on Multimodal Interaction, ICMI 2019, Suzhou, China, 14/10/19. https://doi.org/10.1145/3340555.3356102

Deep audio-visual system for closed-setword-level speech recognition. / Yuan, Yougen; Tang, Wei; Fan, Minhao et al.
ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction. ed. / Wen Gao; Helen Mei Ling Meng; Matthew Turk; Susan R. Fussell; Bjorn Schuller; Bjorn Schuller; Yale Song; Kai Yu. Association for Computing Machinery, Inc, 2019. p. 540-545 (ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Deep audio-visual system for closed-setword-level speech recognition

AU - Yuan, Yougen

AU - Tang, Wei

AU - Fan, Minhao

AU - Cao, Yue

AU - Zhang, Peng

AU - Xie, Lei

PY - 2019/10/14

Y1 - 2019/10/14

N2 - Audio-visual understanding is usually challenged by the complementary gap between audio and visual informative bridging. Motivated by the recent audio-visual studies, a closed-set word-level speech recognition scheme is proposed for the Mandarin Audio-Visual Speech Recognition (MAVSR) Challenge in this study. To achieve respective audio and visual encoder initialization more effectively, a 3-dimensional convolutional neural network (CNN) and an attention-based bi-directional long short-term memory (Bi-LSTM) network are trained. With two fully connected layers in addition to the concatenated encoder outputs for the audio-visual joint training, the proposed scheme won the first place with a relative word accuracy improvement of 7.9% over the solitary audio system. Experiments on LRW-1000 dataset have substantially demonstrated that the proposed joint training scheme by audio-visual incorporation is capable of enhancing the recognition performance of relatively short duration samples, unveiling the multi-modal complementarity.

AB - Audio-visual understanding is usually challenged by the complementary gap between audio and visual informative bridging. Motivated by the recent audio-visual studies, a closed-set word-level speech recognition scheme is proposed for the Mandarin Audio-Visual Speech Recognition (MAVSR) Challenge in this study. To achieve respective audio and visual encoder initialization more effectively, a 3-dimensional convolutional neural network (CNN) and an attention-based bi-directional long short-term memory (Bi-LSTM) network are trained. With two fully connected layers in addition to the concatenated encoder outputs for the audio-visual joint training, the proposed scheme won the first place with a relative word accuracy improvement of 7.9% over the solitary audio system. Experiments on LRW-1000 dataset have substantially demonstrated that the proposed joint training scheme by audio-visual incorporation is capable of enhancing the recognition performance of relatively short duration samples, unveiling the multi-modal complementarity.

KW - Audio-visual

KW - Convolutional neural network

KW - Long short-term memory

KW - Multi-model

UR - http://www.scopus.com/inward/record.url?scp=85074923548&partnerID=8YFLogxK

U2 - 10.1145/3340555.3356102

DO - 10.1145/3340555.3356102

M3 - 会议稿件

AN - SCOPUS:85074923548

T3 - ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction

SP - 540

EP - 545

BT - ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction

A2 - Gao, Wen

A2 - Ling Meng, Helen Mei

A2 - Turk, Matthew

A2 - Fussell, Susan R.

A2 - Schuller, Bjorn

A2 - Song, Yale

A2 - Yu, Kai

PB - Association for Computing Machinery, Inc

T2 - 21st ACM International Conference on Multimodal Interaction, ICMI 2019

Y2 - 14 October 2019 through 18 October 2019

ER -

Yuan Y, Tang W, Fan M, Cao Y, Zhang P , Xie L. Deep audio-visual system for closed-setword-level speech recognition. In Gao W, Ling Meng HM, Turk M, Fussell SR, Schuller B, Schuller B, Song Y, Yu K, editors, ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction. Association for Computing Machinery, Inc. 2019. p. 540-545. (ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction). doi: 10.1145/3340555.3356102

Deep audio-visual system for closed-setword-level speech recognition

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this