Developing phoneme-based lip-reading sentences system for silent speech recognition

Randa El-Bialy; Daqing Chen; Souheil Fenghour; Walid Hussein; Perry Xiao; Omar H. Karam; Bo Li

doi:10.1049/cit2.12131

Developing phoneme-based lip-reading sentences system for silent speech recognition

Randa El-Bialy, Daqing Chen, Souheil Fenghour, Walid Hussein, Perry Xiao, Omar H. Karam, Bo Li

电子信息学院

科研成果: 期刊稿件 › 文章 › 同行评审

18 引用（Scopus）

摘要

Lip-reading is a process of interpreting speech by visually analysing lip movements. Recent research in this area has shifted from simple word recognition to lip-reading sentences in the wild. This paper attempts to use phonemes as a classification schema for lip-reading sentences to explore an alternative schema and to enhance system performance. Different classification schemas have been investigated, including character-based and visemes-based schemas. The visual front-end model of the system consists of a Spatial-Temporal (3D) convolution followed by a 2D ResNet. Transformers utilise multi-headed attention for phoneme recognition models. For the language model, a Recurrent Neural Network is used. The performance of the proposed system has been testified with the BBC Lip Reading Sentences 2 (LRS2) benchmark dataset. Compared with the state-of-the-art approaches in lip-reading sentences, the proposed system has demonstrated an improved performance by a 10% lower word error rate on average under varying illumination ratios.

源语言	英语
页（从-至）	129-138
页数	10
期刊	CAAI Transactions on Intelligence Technology
卷	8
期	1
DOI	https://doi.org/10.1049/cit2.12131
出版状态	已出版 - 3月 2023

访问文件

10.1049/cit2.12131

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{02c3d17d49b84fac9efaa1ae8e62f2dc,

title = "Developing phoneme-based lip-reading sentences system for silent speech recognition",

abstract = "Lip-reading is a process of interpreting speech by visually analysing lip movements. Recent research in this area has shifted from simple word recognition to lip-reading sentences in the wild. This paper attempts to use phonemes as a classification schema for lip-reading sentences to explore an alternative schema and to enhance system performance. Different classification schemas have been investigated, including character-based and visemes-based schemas. The visual front-end model of the system consists of a Spatial-Temporal (3D) convolution followed by a 2D ResNet. Transformers utilise multi-headed attention for phoneme recognition models. For the language model, a Recurrent Neural Network is used. The performance of the proposed system has been testified with the BBC Lip Reading Sentences 2 (LRS2) benchmark dataset. Compared with the state-of-the-art approaches in lip-reading sentences, the proposed system has demonstrated an improved performance by a 10% lower word error rate on average under varying illumination ratios.",

keywords = "deep learning, deep neural networks, lip-reading, phoneme-based lip-reading, spatial-temporal convolution, transformers",

author = "Randa El-Bialy and Daqing Chen and Souheil Fenghour and Walid Hussein and Perry Xiao and Karam, {Omar H.} and Bo Li",

note = "Publisher Copyright: {\textcopyright} 2022 The Authors. CAAI Transactions on Intelligence Technology published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology and Chongqing University of Technology.",

year = "2023",

month = mar,

doi = "10.1049/cit2.12131",

language = "英语",

volume = "8",

pages = "129--138",

journal = "CAAI Transactions on Intelligence Technology",

issn = "2468-6557",

publisher = "John Wiley & Sons Inc.",

number = "1",

}

TY - JOUR

T1 - Developing phoneme-based lip-reading sentences system for silent speech recognition

AU - El-Bialy, Randa

AU - Chen, Daqing

AU - Fenghour, Souheil

AU - Hussein, Walid

AU - Xiao, Perry

AU - Karam, Omar H.

AU - Li, Bo

N1 - Publisher Copyright: © 2022 The Authors. CAAI Transactions on Intelligence Technology published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology and Chongqing University of Technology.

PY - 2023/3

Y1 - 2023/3

N2 - Lip-reading is a process of interpreting speech by visually analysing lip movements. Recent research in this area has shifted from simple word recognition to lip-reading sentences in the wild. This paper attempts to use phonemes as a classification schema for lip-reading sentences to explore an alternative schema and to enhance system performance. Different classification schemas have been investigated, including character-based and visemes-based schemas. The visual front-end model of the system consists of a Spatial-Temporal (3D) convolution followed by a 2D ResNet. Transformers utilise multi-headed attention for phoneme recognition models. For the language model, a Recurrent Neural Network is used. The performance of the proposed system has been testified with the BBC Lip Reading Sentences 2 (LRS2) benchmark dataset. Compared with the state-of-the-art approaches in lip-reading sentences, the proposed system has demonstrated an improved performance by a 10% lower word error rate on average under varying illumination ratios.

AB - Lip-reading is a process of interpreting speech by visually analysing lip movements. Recent research in this area has shifted from simple word recognition to lip-reading sentences in the wild. This paper attempts to use phonemes as a classification schema for lip-reading sentences to explore an alternative schema and to enhance system performance. Different classification schemas have been investigated, including character-based and visemes-based schemas. The visual front-end model of the system consists of a Spatial-Temporal (3D) convolution followed by a 2D ResNet. Transformers utilise multi-headed attention for phoneme recognition models. For the language model, a Recurrent Neural Network is used. The performance of the proposed system has been testified with the BBC Lip Reading Sentences 2 (LRS2) benchmark dataset. Compared with the state-of-the-art approaches in lip-reading sentences, the proposed system has demonstrated an improved performance by a 10% lower word error rate on average under varying illumination ratios.

KW - deep learning

KW - deep neural networks

KW - lip-reading

KW - phoneme-based lip-reading

KW - spatial-temporal convolution

KW - transformers

UR - http://www.scopus.com/inward/record.url?scp=85135966885&partnerID=8YFLogxK

U2 - 10.1049/cit2.12131

DO - 10.1049/cit2.12131

M3 - 文章

AN - SCOPUS:85135966885

SN - 2468-6557

VL - 8

SP - 129

EP - 138

JO - CAAI Transactions on Intelligence Technology

JF - CAAI Transactions on Intelligence Technology

IS - 1

ER -

Developing phoneme-based lip-reading sentences system for silent speech recognition

摘要

访问文件

其它文件与链接

指纹

引用此