Deep temporal architecture for audiovisual speech recognition

Chunlin Tian; Yuan Yuan; Xiaoqiang Lu

doi:10.1007/978-981-10-7299-4_54

Deep temporal architecture for audiovisual speech recognition

Chunlin Tian, Yuan Yuan, Xiaoqiang Lu

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

3 Scopus citations

Abstract

The Audiovisual Speech Recognition (AVSR) is one of the applications of multimodal machine learning related to speech recognition, lipreading systems and video classification. In recent and related work, increasing efforts are made in Deep Neural Network (DNN) for AVSR, moreover some DNN models including Multimodal Deep Autoencoder, Multimodal Deep Belief Network and Multimodal Deep Boltzmann Machine perform well in experiments owing to the better generalization and nonlinear transformation. However, these DNN models have several disadvantages: (1) They mainly deal with modal fusion while ignoring temporal fusion. (2) Traditional methods fail to consider the connection among frames in the modal fusion. (3) These models aren’t end-to-end structure. We propose a deep temporal architecture, which has not only classical modal fusion, but temporal modal fusion and temporal fusion. Furthermore, the overfitting and learning with small size samples in the AVSR are also studied, so that we propose a set of useful training strategies. The experiments show the superiority of our model and necessity of the training strategies in three datasets: AVLetters, AVLetters2, AVDigits. In the end, we conclude the work.

Original language	English
Title of host publication	Computer Vision - 2nd CCF Chinese Conference, CCCV 2017, Proceedings
Editors	Xiang Bai, Qinghua Hu, Liang Wang, Qingshan Liu, Jinfeng Yang, Ming-Ming Cheng, Deyu Meng
Publisher	Springer Verlag
Pages	650-661
Number of pages	12
ISBN (Print)	9789811072987
DOIs	https://doi.org/10.1007/978-981-10-7299-4_54
State	Published - 2017
Externally published	Yes
Event	2nd Chinese Conference on Computer Vision, CCCV 2017 - Tianjin, China Duration: 11 Oct 2017 → 14 Oct 2017

Publication series

Name	Communications in Computer and Information Science
Volume	771
ISSN (Print)	1865-0929

Conference

Conference	2nd Chinese Conference on Computer Vision, CCCV 2017
Country/Territory	China
City	Tianjin
Period	11/10/17 → 14/10/17

Keywords

Audiovisual speech recognition
Multimodal deep learning

Access to Document

10.1007/978-981-10-7299-4_54

Cite this

Tian, C., Yuan, Y., & Lu, X. (2017). Deep temporal architecture for audiovisual speech recognition. In X. Bai, Q. Hu, L. Wang, Q. Liu, J. Yang, M.-M. Cheng, & D. Meng (Eds.), Computer Vision - 2nd CCF Chinese Conference, CCCV 2017, Proceedings (pp. 650-661). (Communications in Computer and Information Science; Vol. 771). Springer Verlag. https://doi.org/10.1007/978-981-10-7299-4_54

Tian, Chunlin ; Yuan, Yuan ; Lu, Xiaoqiang. / Deep temporal architecture for audiovisual speech recognition. Computer Vision - 2nd CCF Chinese Conference, CCCV 2017, Proceedings. editor / Xiang Bai ; Qinghua Hu ; Liang Wang ; Qingshan Liu ; Jinfeng Yang ; Ming-Ming Cheng ; Deyu Meng. Springer Verlag, 2017. pp. 650-661 (Communications in Computer and Information Science).

@inproceedings{96f5b5852dda41fa8dc5fa88f48d4b11,

title = "Deep temporal architecture for audiovisual speech recognition",

abstract = "The Audiovisual Speech Recognition (AVSR) is one of the applications of multimodal machine learning related to speech recognition, lipreading systems and video classification. In recent and related work, increasing efforts are made in Deep Neural Network (DNN) for AVSR, moreover some DNN models including Multimodal Deep Autoencoder, Multimodal Deep Belief Network and Multimodal Deep Boltzmann Machine perform well in experiments owing to the better generalization and nonlinear transformation. However, these DNN models have several disadvantages: (1) They mainly deal with modal fusion while ignoring temporal fusion. (2) Traditional methods fail to consider the connection among frames in the modal fusion. (3) These models aren{\textquoteright}t end-to-end structure. We propose a deep temporal architecture, which has not only classical modal fusion, but temporal modal fusion and temporal fusion. Furthermore, the overfitting and learning with small size samples in the AVSR are also studied, so that we propose a set of useful training strategies. The experiments show the superiority of our model and necessity of the training strategies in three datasets: AVLetters, AVLetters2, AVDigits. In the end, we conclude the work.",

keywords = "Audiovisual speech recognition, Multimodal deep learning",

author = "Chunlin Tian and Yuan Yuan and Xiaoqiang Lu",

note = "Publisher Copyright: {\textcopyright} Springer Nature Singapore Pte Ltd. 2017.; 2nd Chinese Conference on Computer Vision, CCCV 2017 ; Conference date: 11-10-2017 Through 14-10-2017",

year = "2017",

doi = "10.1007/978-981-10-7299-4_54",

language = "英语",

isbn = "9789811072987",

series = "Communications in Computer and Information Science",

publisher = "Springer Verlag",

pages = "650--661",

editor = "Xiang Bai and Qinghua Hu and Liang Wang and Qingshan Liu and Jinfeng Yang and Ming-Ming Cheng and Deyu Meng",

booktitle = "Computer Vision - 2nd CCF Chinese Conference, CCCV 2017, Proceedings",

}

Tian, C, Yuan, Y & Lu, X 2017, Deep temporal architecture for audiovisual speech recognition. in X Bai, Q Hu, L Wang, Q Liu, J Yang, M-M Cheng & D Meng (eds), Computer Vision - 2nd CCF Chinese Conference, CCCV 2017, Proceedings. Communications in Computer and Information Science, vol. 771, Springer Verlag, pp. 650-661, 2nd Chinese Conference on Computer Vision, CCCV 2017, Tianjin, China, 11/10/17. https://doi.org/10.1007/978-981-10-7299-4_54

Deep temporal architecture for audiovisual speech recognition. / Tian, Chunlin; Yuan, Yuan; Lu, Xiaoqiang.
Computer Vision - 2nd CCF Chinese Conference, CCCV 2017, Proceedings. ed. / Xiang Bai; Qinghua Hu; Liang Wang; Qingshan Liu; Jinfeng Yang; Ming-Ming Cheng; Deyu Meng. Springer Verlag, 2017. p. 650-661 (Communications in Computer and Information Science; Vol. 771).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Deep temporal architecture for audiovisual speech recognition

AU - Tian, Chunlin

AU - Yuan, Yuan

AU - Lu, Xiaoqiang

N1 - Publisher Copyright: © Springer Nature Singapore Pte Ltd. 2017.

PY - 2017

Y1 - 2017

N2 - The Audiovisual Speech Recognition (AVSR) is one of the applications of multimodal machine learning related to speech recognition, lipreading systems and video classification. In recent and related work, increasing efforts are made in Deep Neural Network (DNN) for AVSR, moreover some DNN models including Multimodal Deep Autoencoder, Multimodal Deep Belief Network and Multimodal Deep Boltzmann Machine perform well in experiments owing to the better generalization and nonlinear transformation. However, these DNN models have several disadvantages: (1) They mainly deal with modal fusion while ignoring temporal fusion. (2) Traditional methods fail to consider the connection among frames in the modal fusion. (3) These models aren’t end-to-end structure. We propose a deep temporal architecture, which has not only classical modal fusion, but temporal modal fusion and temporal fusion. Furthermore, the overfitting and learning with small size samples in the AVSR are also studied, so that we propose a set of useful training strategies. The experiments show the superiority of our model and necessity of the training strategies in three datasets: AVLetters, AVLetters2, AVDigits. In the end, we conclude the work.

AB - The Audiovisual Speech Recognition (AVSR) is one of the applications of multimodal machine learning related to speech recognition, lipreading systems and video classification. In recent and related work, increasing efforts are made in Deep Neural Network (DNN) for AVSR, moreover some DNN models including Multimodal Deep Autoencoder, Multimodal Deep Belief Network and Multimodal Deep Boltzmann Machine perform well in experiments owing to the better generalization and nonlinear transformation. However, these DNN models have several disadvantages: (1) They mainly deal with modal fusion while ignoring temporal fusion. (2) Traditional methods fail to consider the connection among frames in the modal fusion. (3) These models aren’t end-to-end structure. We propose a deep temporal architecture, which has not only classical modal fusion, but temporal modal fusion and temporal fusion. Furthermore, the overfitting and learning with small size samples in the AVSR are also studied, so that we propose a set of useful training strategies. The experiments show the superiority of our model and necessity of the training strategies in three datasets: AVLetters, AVLetters2, AVDigits. In the end, we conclude the work.

KW - Audiovisual speech recognition

KW - Multimodal deep learning

UR - http://www.scopus.com/inward/record.url?scp=85037845041&partnerID=8YFLogxK

U2 - 10.1007/978-981-10-7299-4_54

DO - 10.1007/978-981-10-7299-4_54

M3 - 会议稿件

AN - SCOPUS:85037845041

SN - 9789811072987

T3 - Communications in Computer and Information Science

SP - 650

EP - 661

BT - Computer Vision - 2nd CCF Chinese Conference, CCCV 2017, Proceedings

A2 - Bai, Xiang

A2 - Hu, Qinghua

A2 - Wang, Liang

A2 - Liu, Qingshan

A2 - Yang, Jinfeng

A2 - Cheng, Ming-Ming

A2 - Meng, Deyu

PB - Springer Verlag

T2 - 2nd Chinese Conference on Computer Vision, CCCV 2017

Y2 - 11 October 2017 through 14 October 2017

ER -

Deep temporal architecture for audiovisual speech recognition

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this