Deep temporal architecture for audiovisual speech recognition

Chunlin Tian; Yuan Yuan; Xiaoqiang Lu

doi:10.1007/978-981-10-7299-4_54

Deep temporal architecture for audiovisual speech recognition

Chunlin Tian, Yuan Yuan, Xiaoqiang Lu

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

3 引用（Scopus）

摘要

The Audiovisual Speech Recognition (AVSR) is one of the applications of multimodal machine learning related to speech recognition, lipreading systems and video classification. In recent and related work, increasing efforts are made in Deep Neural Network (DNN) for AVSR, moreover some DNN models including Multimodal Deep Autoencoder, Multimodal Deep Belief Network and Multimodal Deep Boltzmann Machine perform well in experiments owing to the better generalization and nonlinear transformation. However, these DNN models have several disadvantages: (1) They mainly deal with modal fusion while ignoring temporal fusion. (2) Traditional methods fail to consider the connection among frames in the modal fusion. (3) These models aren’t end-to-end structure. We propose a deep temporal architecture, which has not only classical modal fusion, but temporal modal fusion and temporal fusion. Furthermore, the overfitting and learning with small size samples in the AVSR are also studied, so that we propose a set of useful training strategies. The experiments show the superiority of our model and necessity of the training strategies in three datasets: AVLetters, AVLetters2, AVDigits. In the end, we conclude the work.

源语言	英语
主期刊名	Computer Vision - 2nd CCF Chinese Conference, CCCV 2017, Proceedings
编辑	Xiang Bai, Qinghua Hu, Liang Wang, Qingshan Liu, Jinfeng Yang, Ming-Ming Cheng, Deyu Meng
出版商	Springer Verlag
页	650-661
页数	12
ISBN（印刷版）	9789811072987
DOI	https://doi.org/10.1007/978-981-10-7299-4_54
出版状态	已出版 - 2017
已对外发布	是
活动	2nd Chinese Conference on Computer Vision, CCCV 2017 - Tianjin, 中国期限: 11 10月 2017 → 14 10月 2017

出版系列

姓名	Communications in Computer and Information Science
卷	771
ISSN（印刷版）	1865-0929

会议

会议	2nd Chinese Conference on Computer Vision, CCCV 2017
国家/地区	中国
市	Tianjin
时期	11/10/17 → 14/10/17

访问文件

10.1007/978-981-10-7299-4_54

其它文件与链接

链接到 Scopus 的出版物

引用此

Tian, C., Yuan, Y., & Lu, X. (2017). Deep temporal architecture for audiovisual speech recognition. 在 X. Bai, Q. Hu, L. Wang, Q. Liu, J. Yang, M.-M. Cheng, & D. Meng (编辑), Computer Vision - 2nd CCF Chinese Conference, CCCV 2017, Proceedings (页码 650-661). (Communications in Computer and Information Science; 卷 771). Springer Verlag. https://doi.org/10.1007/978-981-10-7299-4_54

Tian, Chunlin ; Yuan, Yuan ; Lu, Xiaoqiang. / Deep temporal architecture for audiovisual speech recognition. Computer Vision - 2nd CCF Chinese Conference, CCCV 2017, Proceedings. 编辑 / Xiang Bai ; Qinghua Hu ; Liang Wang ; Qingshan Liu ; Jinfeng Yang ; Ming-Ming Cheng ; Deyu Meng. Springer Verlag, 2017. 页码 650-661 (Communications in Computer and Information Science).

@inproceedings{96f5b5852dda41fa8dc5fa88f48d4b11,

title = "Deep temporal architecture for audiovisual speech recognition",

abstract = "The Audiovisual Speech Recognition (AVSR) is one of the applications of multimodal machine learning related to speech recognition, lipreading systems and video classification. In recent and related work, increasing efforts are made in Deep Neural Network (DNN) for AVSR, moreover some DNN models including Multimodal Deep Autoencoder, Multimodal Deep Belief Network and Multimodal Deep Boltzmann Machine perform well in experiments owing to the better generalization and nonlinear transformation. However, these DNN models have several disadvantages: (1) They mainly deal with modal fusion while ignoring temporal fusion. (2) Traditional methods fail to consider the connection among frames in the modal fusion. (3) These models aren{\textquoteright}t end-to-end structure. We propose a deep temporal architecture, which has not only classical modal fusion, but temporal modal fusion and temporal fusion. Furthermore, the overfitting and learning with small size samples in the AVSR are also studied, so that we propose a set of useful training strategies. The experiments show the superiority of our model and necessity of the training strategies in three datasets: AVLetters, AVLetters2, AVDigits. In the end, we conclude the work.",

keywords = "Audiovisual speech recognition, Multimodal deep learning",

author = "Chunlin Tian and Yuan Yuan and Xiaoqiang Lu",

note = "Publisher Copyright: {\textcopyright} Springer Nature Singapore Pte Ltd. 2017.; 2nd Chinese Conference on Computer Vision, CCCV 2017 ; Conference date: 11-10-2017 Through 14-10-2017",

year = "2017",

doi = "10.1007/978-981-10-7299-4_54",

language = "英语",

isbn = "9789811072987",

series = "Communications in Computer and Information Science",

publisher = "Springer Verlag",

pages = "650--661",

editor = "Xiang Bai and Qinghua Hu and Liang Wang and Qingshan Liu and Jinfeng Yang and Ming-Ming Cheng and Deyu Meng",

booktitle = "Computer Vision - 2nd CCF Chinese Conference, CCCV 2017, Proceedings",

}

Tian, C, Yuan, Y & Lu, X 2017, Deep temporal architecture for audiovisual speech recognition. 在 X Bai, Q Hu, L Wang, Q Liu, J Yang, M-M Cheng & D Meng (编辑), Computer Vision - 2nd CCF Chinese Conference, CCCV 2017, Proceedings. Communications in Computer and Information Science, 卷 771, Springer Verlag, 页码 650-661, 2nd Chinese Conference on Computer Vision, CCCV 2017, Tianjin, 中国, 11/10/17. https://doi.org/10.1007/978-981-10-7299-4_54

Deep temporal architecture for audiovisual speech recognition. / Tian, Chunlin; Yuan, Yuan; Lu, Xiaoqiang.
Computer Vision - 2nd CCF Chinese Conference, CCCV 2017, Proceedings. 编辑 / Xiang Bai; Qinghua Hu; Liang Wang; Qingshan Liu; Jinfeng Yang; Ming-Ming Cheng; Deyu Meng. Springer Verlag, 2017. 页码 650-661 (Communications in Computer and Information Science; 卷 771).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Deep temporal architecture for audiovisual speech recognition

AU - Tian, Chunlin

AU - Yuan, Yuan

AU - Lu, Xiaoqiang

N1 - Publisher Copyright: © Springer Nature Singapore Pte Ltd. 2017.

PY - 2017

Y1 - 2017

N2 - The Audiovisual Speech Recognition (AVSR) is one of the applications of multimodal machine learning related to speech recognition, lipreading systems and video classification. In recent and related work, increasing efforts are made in Deep Neural Network (DNN) for AVSR, moreover some DNN models including Multimodal Deep Autoencoder, Multimodal Deep Belief Network and Multimodal Deep Boltzmann Machine perform well in experiments owing to the better generalization and nonlinear transformation. However, these DNN models have several disadvantages: (1) They mainly deal with modal fusion while ignoring temporal fusion. (2) Traditional methods fail to consider the connection among frames in the modal fusion. (3) These models aren’t end-to-end structure. We propose a deep temporal architecture, which has not only classical modal fusion, but temporal modal fusion and temporal fusion. Furthermore, the overfitting and learning with small size samples in the AVSR are also studied, so that we propose a set of useful training strategies. The experiments show the superiority of our model and necessity of the training strategies in three datasets: AVLetters, AVLetters2, AVDigits. In the end, we conclude the work.

AB - The Audiovisual Speech Recognition (AVSR) is one of the applications of multimodal machine learning related to speech recognition, lipreading systems and video classification. In recent and related work, increasing efforts are made in Deep Neural Network (DNN) for AVSR, moreover some DNN models including Multimodal Deep Autoencoder, Multimodal Deep Belief Network and Multimodal Deep Boltzmann Machine perform well in experiments owing to the better generalization and nonlinear transformation. However, these DNN models have several disadvantages: (1) They mainly deal with modal fusion while ignoring temporal fusion. (2) Traditional methods fail to consider the connection among frames in the modal fusion. (3) These models aren’t end-to-end structure. We propose a deep temporal architecture, which has not only classical modal fusion, but temporal modal fusion and temporal fusion. Furthermore, the overfitting and learning with small size samples in the AVSR are also studied, so that we propose a set of useful training strategies. The experiments show the superiority of our model and necessity of the training strategies in three datasets: AVLetters, AVLetters2, AVDigits. In the end, we conclude the work.

KW - Audiovisual speech recognition

KW - Multimodal deep learning

UR - http://www.scopus.com/inward/record.url?scp=85037845041&partnerID=8YFLogxK

U2 - 10.1007/978-981-10-7299-4_54

DO - 10.1007/978-981-10-7299-4_54

M3 - 会议稿件

AN - SCOPUS:85037845041

SN - 9789811072987

T3 - Communications in Computer and Information Science

SP - 650

EP - 661

BT - Computer Vision - 2nd CCF Chinese Conference, CCCV 2017, Proceedings

A2 - Bai, Xiang

A2 - Hu, Qinghua

A2 - Wang, Liang

A2 - Liu, Qingshan

A2 - Yang, Jinfeng

A2 - Cheng, Ming-Ming

A2 - Meng, Deyu

PB - Springer Verlag

T2 - 2nd Chinese Conference on Computer Vision, CCCV 2017

Y2 - 11 October 2017 through 14 October 2017

ER -

Deep temporal architecture for audiovisual speech recognition

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此