Deep audio-visual system for closed-setword-level speech recognition

Yougen Yuan; Wei Tang; Minhao Fan; Yue Cao; Peng Zhang; Lei Xie

doi:10.1145/3340555.3356102

Deep audio-visual system for closed-setword-level speech recognition

Yougen Yuan, Wei Tang, Minhao Fan, Yue Cao, Peng Zhang, Lei Xie

计算机学院

Northwestern Polytechnical University Xian

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

1 引用（Scopus）

摘要

Audio-visual understanding is usually challenged by the complementary gap between audio and visual informative bridging. Motivated by the recent audio-visual studies, a closed-set word-level speech recognition scheme is proposed for the Mandarin Audio-Visual Speech Recognition (MAVSR) Challenge in this study. To achieve respective audio and visual encoder initialization more effectively, a 3-dimensional convolutional neural network (CNN) and an attention-based bi-directional long short-term memory (Bi-LSTM) network are trained. With two fully connected layers in addition to the concatenated encoder outputs for the audio-visual joint training, the proposed scheme won the first place with a relative word accuracy improvement of 7.9% over the solitary audio system. Experiments on LRW-1000 dataset have substantially demonstrated that the proposed joint training scheme by audio-visual incorporation is capable of enhancing the recognition performance of relatively short duration samples, unveiling the multi-modal complementarity.

源语言	英语
主期刊名	ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction
编辑	Wen Gao, Helen Mei Ling Meng, Matthew Turk, Susan R. Fussell, Bjorn Schuller, Bjorn Schuller, Yale Song, Kai Yu
出版商	Association for Computing Machinery, Inc
页	540-545
页数	6
ISBN（电子版）	9781450368605
DOI	https://doi.org/10.1145/3340555.3356102
出版状态	已出版 - 14 10月 2019
活动	21st ACM International Conference on Multimodal Interaction, ICMI 2019 - Suzhou, 中国期限: 14 10月 2019 → 18 10月 2019

出版系列

姓名	ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction

会议

会议	21st ACM International Conference on Multimodal Interaction, ICMI 2019
国家/地区	中国
市	Suzhou
时期	14/10/19 → 18/10/19

访问文件

10.1145/3340555.3356102

其它文件与链接

链接到 Scopus 的出版物

引用此

Yuan, Y., Tang, W., Fan, M., Cao, Y., Zhang, P., & Xie, L. (2019). Deep audio-visual system for closed-setword-level speech recognition. 在 W. Gao, H. M. Ling Meng, M. Turk, S. R. Fussell, B. Schuller, B. Schuller, Y. Song, & K. Yu (编辑), ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction (页码 540-545). (ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction). Association for Computing Machinery, Inc. https://doi.org/10.1145/3340555.3356102

Yuan, Yougen ; Tang, Wei ; Fan, Minhao 等. / Deep audio-visual system for closed-setword-level speech recognition. ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction. 编辑 / Wen Gao ; Helen Mei Ling Meng ; Matthew Turk ; Susan R. Fussell ; Bjorn Schuller ; Bjorn Schuller ; Yale Song ; Kai Yu. Association for Computing Machinery, Inc, 2019. 页码 540-545 (ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction).

@inproceedings{92c8ea8d35484bc49f762c4b750c629f,

title = "Deep audio-visual system for closed-setword-level speech recognition",

abstract = "Audio-visual understanding is usually challenged by the complementary gap between audio and visual informative bridging. Motivated by the recent audio-visual studies, a closed-set word-level speech recognition scheme is proposed for the Mandarin Audio-Visual Speech Recognition (MAVSR) Challenge in this study. To achieve respective audio and visual encoder initialization more effectively, a 3-dimensional convolutional neural network (CNN) and an attention-based bi-directional long short-term memory (Bi-LSTM) network are trained. With two fully connected layers in addition to the concatenated encoder outputs for the audio-visual joint training, the proposed scheme won the first place with a relative word accuracy improvement of 7.9% over the solitary audio system. Experiments on LRW-1000 dataset have substantially demonstrated that the proposed joint training scheme by audio-visual incorporation is capable of enhancing the recognition performance of relatively short duration samples, unveiling the multi-modal complementarity.",

keywords = "Audio-visual, Convolutional neural network, Long short-term memory, Multi-model",

author = "Yougen Yuan and Wei Tang and Minhao Fan and Yue Cao and Peng Zhang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2019 Association for Computing Machinery.; 21st ACM International Conference on Multimodal Interaction, ICMI 2019 ; Conference date: 14-10-2019 Through 18-10-2019",

year = "2019",

month = oct,

day = "14",

doi = "10.1145/3340555.3356102",

language = "英语",

series = "ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction",

publisher = "Association for Computing Machinery, Inc",

pages = "540--545",

editor = "Wen Gao and {Ling Meng}, {Helen Mei} and Matthew Turk and Fussell, {Susan R.} and Bjorn Schuller and Bjorn Schuller and Yale Song and Kai Yu",

booktitle = "ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction",

}

Yuan, Y, Tang, W, Fan, M, Cao, Y, Zhang, P & Xie, L 2019, Deep audio-visual system for closed-setword-level speech recognition. 在 W Gao, HM Ling Meng, M Turk, SR Fussell, B Schuller, B Schuller, Y Song & K Yu (编辑), ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction. ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction, Association for Computing Machinery, Inc, 页码 540-545, 21st ACM International Conference on Multimodal Interaction, ICMI 2019, Suzhou, 中国, 14/10/19. https://doi.org/10.1145/3340555.3356102

Deep audio-visual system for closed-setword-level speech recognition. / Yuan, Yougen; Tang, Wei; Fan, Minhao 等.
ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction. 编辑 / Wen Gao; Helen Mei Ling Meng; Matthew Turk; Susan R. Fussell; Bjorn Schuller; Bjorn Schuller; Yale Song; Kai Yu. Association for Computing Machinery, Inc, 2019. 页码 540-545 (ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Deep audio-visual system for closed-setword-level speech recognition

AU - Yuan, Yougen

AU - Tang, Wei

AU - Fan, Minhao

AU - Cao, Yue

AU - Zhang, Peng

AU - Xie, Lei

PY - 2019/10/14

Y1 - 2019/10/14

N2 - Audio-visual understanding is usually challenged by the complementary gap between audio and visual informative bridging. Motivated by the recent audio-visual studies, a closed-set word-level speech recognition scheme is proposed for the Mandarin Audio-Visual Speech Recognition (MAVSR) Challenge in this study. To achieve respective audio and visual encoder initialization more effectively, a 3-dimensional convolutional neural network (CNN) and an attention-based bi-directional long short-term memory (Bi-LSTM) network are trained. With two fully connected layers in addition to the concatenated encoder outputs for the audio-visual joint training, the proposed scheme won the first place with a relative word accuracy improvement of 7.9% over the solitary audio system. Experiments on LRW-1000 dataset have substantially demonstrated that the proposed joint training scheme by audio-visual incorporation is capable of enhancing the recognition performance of relatively short duration samples, unveiling the multi-modal complementarity.

AB - Audio-visual understanding is usually challenged by the complementary gap between audio and visual informative bridging. Motivated by the recent audio-visual studies, a closed-set word-level speech recognition scheme is proposed for the Mandarin Audio-Visual Speech Recognition (MAVSR) Challenge in this study. To achieve respective audio and visual encoder initialization more effectively, a 3-dimensional convolutional neural network (CNN) and an attention-based bi-directional long short-term memory (Bi-LSTM) network are trained. With two fully connected layers in addition to the concatenated encoder outputs for the audio-visual joint training, the proposed scheme won the first place with a relative word accuracy improvement of 7.9% over the solitary audio system. Experiments on LRW-1000 dataset have substantially demonstrated that the proposed joint training scheme by audio-visual incorporation is capable of enhancing the recognition performance of relatively short duration samples, unveiling the multi-modal complementarity.

KW - Audio-visual

KW - Convolutional neural network

KW - Long short-term memory

KW - Multi-model

UR - http://www.scopus.com/inward/record.url?scp=85074923548&partnerID=8YFLogxK

U2 - 10.1145/3340555.3356102

DO - 10.1145/3340555.3356102

M3 - 会议稿件

AN - SCOPUS:85074923548

T3 - ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction

SP - 540

EP - 545

BT - ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction

A2 - Gao, Wen

A2 - Ling Meng, Helen Mei

A2 - Turk, Matthew

A2 - Fussell, Susan R.

A2 - Schuller, Bjorn

A2 - Song, Yale

A2 - Yu, Kai

PB - Association for Computing Machinery, Inc

T2 - 21st ACM International Conference on Multimodal Interaction, ICMI 2019

Y2 - 14 October 2019 through 18 October 2019

ER -

Yuan Y, Tang W, Fan M, Cao Y, Zhang P , Xie L. Deep audio-visual system for closed-setword-level speech recognition. 在 Gao W, Ling Meng HM, Turk M, Fussell SR, Schuller B, Schuller B, Song Y, Yu K, 编辑, ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction. Association for Computing Machinery, Inc. 2019. 页码 540-545. (ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction). doi: 10.1145/3340555.3356102

Deep audio-visual system for closed-setword-level speech recognition

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此