An effective conversion of visemes to words for high-performance automatic lipreading

Souheil Fenghour; Daqing Chen; Kun Guo; Bo Li; Perry Xiao

doi:10.3390/s21237890

An effective conversion of visemes to words for high-performance automatic lipreading

Souheil Fenghour, Daqing Chen, Kun Guo, Bo Li, Perry Xiao

电子信息学院

科研成果: 期刊稿件 › 文章 › 同行评审

9 引用（Scopus）

摘要

As an alternative approach, viseme-based lipreading systems have demonstrated promising performance results in decoding videos of people uttering entire sentences. However, the overall performance of such systems has been significantly affected by the efficiency of the conversion of visemes to words during the lipreading process. As shown in the literature, the issue has become a bottleneck of such systems where the system’s performance can decrease dramatically from a high classification accuracy of visemes (e.g., over 90%) to a comparatively very low classification accuracy of words (e.g., only just over 60%). The underlying cause of this phenomenon is that roughly half of the words in the English language are homophemes, i.e., a set of visemes can map to multiple words, e.g., “time” and “some”. In this paper, aiming to tackle this issue, a deep learning network model with an Attention based Gated Recurrent Unit is proposed for efficient viseme-to-word conversion and compared against three other approaches. The proposed approach features strong robustness, high efficiency, and short execution time. The approach has been verified with analysis and practical experiments of predicting sentences from benchmark LRS2 and LRS3 datasets. The main contributions of the paper are as follows: (1) A model is developed, which is effective in converting visemes to words, discriminating between homopheme words, and is robust to incorrectly classified visemes; (2) the model proposed uses a few parameters and, therefore, little overhead and time are required to train and execute; and (3) an improved performance in predicting spoken sentences from the LRS2 dataset with an attained word accuracy rate of 79.6%—an improvement of 15.0% compared with the state-of-the-art approaches.

源语言	英语
文章编号	7890
期刊	Sensors
卷	21
期	23
DOI	https://doi.org/10.3390/s21237890
出版状态	已出版 - 1 12月 2021

访问文件

10.3390/s21237890

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{baf800b5fdc44beb829da7cc266964e8,

title = "An effective conversion of visemes to words for high-performance automatic lipreading",

abstract = "As an alternative approach, viseme-based lipreading systems have demonstrated promising performance results in decoding videos of people uttering entire sentences. However, the overall performance of such systems has been significantly affected by the efficiency of the conversion of visemes to words during the lipreading process. As shown in the literature, the issue has become a bottleneck of such systems where the system{\textquoteright}s performance can decrease dramatically from a high classification accuracy of visemes (e.g., over 90%) to a comparatively very low classification accuracy of words (e.g., only just over 60%). The underlying cause of this phenomenon is that roughly half of the words in the English language are homophemes, i.e., a set of visemes can map to multiple words, e.g., “time” and “some”. In this paper, aiming to tackle this issue, a deep learning network model with an Attention based Gated Recurrent Unit is proposed for efficient viseme-to-word conversion and compared against three other approaches. The proposed approach features strong robustness, high efficiency, and short execution time. The approach has been verified with analysis and practical experiments of predicting sentences from benchmark LRS2 and LRS3 datasets. The main contributions of the paper are as follows: (1) A model is developed, which is effective in converting visemes to words, discriminating between homopheme words, and is robust to incorrectly classified visemes; (2) the model proposed uses a few parameters and, therefore, little overhead and time are required to train and execute; and (3) an improved performance in predicting spoken sentences from the LRS2 dataset with an attained word accuracy rate of 79.6%—an improvement of 15.0% compared with the state-of-the-art approaches.",

keywords = "Augmenta-tion, Deep learning, Gated recurrent unit, Lip reading, Neural networks, Recurrent neural networks, Robustness, Speech recognition, Visemes",

author = "Souheil Fenghour and Daqing Chen and Kun Guo and Bo Li and Perry Xiao",

note = "Publisher Copyright: {\textcopyright} 2021 by the authors. Licensee MDPI, Basel, Switzerland.",

year = "2021",

month = dec,

day = "1",

doi = "10.3390/s21237890",

language = "英语",

volume = "21",

journal = "Sensors",

issn = "1424-8220",

publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",

number = "23",

}

TY - JOUR

T1 - An effective conversion of visemes to words for high-performance automatic lipreading

AU - Fenghour, Souheil

AU - Chen, Daqing

AU - Guo, Kun

AU - Li, Bo

AU - Xiao, Perry

PY - 2021/12/1

Y1 - 2021/12/1

N2 - As an alternative approach, viseme-based lipreading systems have demonstrated promising performance results in decoding videos of people uttering entire sentences. However, the overall performance of such systems has been significantly affected by the efficiency of the conversion of visemes to words during the lipreading process. As shown in the literature, the issue has become a bottleneck of such systems where the system’s performance can decrease dramatically from a high classification accuracy of visemes (e.g., over 90%) to a comparatively very low classification accuracy of words (e.g., only just over 60%). The underlying cause of this phenomenon is that roughly half of the words in the English language are homophemes, i.e., a set of visemes can map to multiple words, e.g., “time” and “some”. In this paper, aiming to tackle this issue, a deep learning network model with an Attention based Gated Recurrent Unit is proposed for efficient viseme-to-word conversion and compared against three other approaches. The proposed approach features strong robustness, high efficiency, and short execution time. The approach has been verified with analysis and practical experiments of predicting sentences from benchmark LRS2 and LRS3 datasets. The main contributions of the paper are as follows: (1) A model is developed, which is effective in converting visemes to words, discriminating between homopheme words, and is robust to incorrectly classified visemes; (2) the model proposed uses a few parameters and, therefore, little overhead and time are required to train and execute; and (3) an improved performance in predicting spoken sentences from the LRS2 dataset with an attained word accuracy rate of 79.6%—an improvement of 15.0% compared with the state-of-the-art approaches.

AB - As an alternative approach, viseme-based lipreading systems have demonstrated promising performance results in decoding videos of people uttering entire sentences. However, the overall performance of such systems has been significantly affected by the efficiency of the conversion of visemes to words during the lipreading process. As shown in the literature, the issue has become a bottleneck of such systems where the system’s performance can decrease dramatically from a high classification accuracy of visemes (e.g., over 90%) to a comparatively very low classification accuracy of words (e.g., only just over 60%). The underlying cause of this phenomenon is that roughly half of the words in the English language are homophemes, i.e., a set of visemes can map to multiple words, e.g., “time” and “some”. In this paper, aiming to tackle this issue, a deep learning network model with an Attention based Gated Recurrent Unit is proposed for efficient viseme-to-word conversion and compared against three other approaches. The proposed approach features strong robustness, high efficiency, and short execution time. The approach has been verified with analysis and practical experiments of predicting sentences from benchmark LRS2 and LRS3 datasets. The main contributions of the paper are as follows: (1) A model is developed, which is effective in converting visemes to words, discriminating between homopheme words, and is robust to incorrectly classified visemes; (2) the model proposed uses a few parameters and, therefore, little overhead and time are required to train and execute; and (3) an improved performance in predicting spoken sentences from the LRS2 dataset with an attained word accuracy rate of 79.6%—an improvement of 15.0% compared with the state-of-the-art approaches.

KW - Augmenta-tion

KW - Deep learning

KW - Gated recurrent unit

KW - Lip reading

KW - Neural networks

KW - Recurrent neural networks

KW - Robustness

KW - Speech recognition

KW - Visemes

UR - http://www.scopus.com/inward/record.url?scp=85119718685&partnerID=8YFLogxK

U2 - 10.3390/s21237890

DO - 10.3390/s21237890

M3 - 文章

C2 - 34883888

AN - SCOPUS:85119718685

SN - 1424-8220

VL - 21

JO - Sensors

JF - Sensors

IS - 23

M1 - 7890

ER -

An effective conversion of visemes to words for high-performance automatic lipreading

摘要

访问文件

其它文件与链接

指纹

引用此