Skip to main navigation Skip to search Skip to main content

Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

  • He Wang
  • , Pengcheng Guo
  • , Xucheng Wan
  • , Huan Zhou
  • , Lei Xie
  • Northwestern Polytechnical University Xian
  • Huawei Technologies Co., Ltd.

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

Automatic lip-reading (ALR) aims to automatically tran-scribe spoken content from a speaker's silent lip motion captured in video. Current mainstream lip-reading approaches only use a single visual encoder to model input videos of a single scale. In this paper, we propose to enhance lip-reading by incorporating multi-scale video data and multi-encoder. Specifically, we first introduce a novel multi-scale lip motion extraction algorithm based on the size of the speaker's face and propose an Enhanced ResNet3D visual front-end (VFE) to extract lip features at different scales. For the multi-encoder, in addition to the mainstream Transformer and Conformer, we also incorporate the recently proposed Branch-former and E-Branchformer as visual encoders. In the experiments, we explore the influence of different video data scales and encoders on ALR system performance and fuse the texts transcribed by all ALR systems using recognizer output voting error reduction (ROVER). Finally, our proposed approach placed second in the ICME 2024 ChatCLR Challenge Task 2, with a 21.52% reduction in character error rate (CER) compared to the official baseline on the evaluation set.

Original languageEnglish
Title of host publication2024 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350379815
DOIs
StatePublished - 2024
Event2024 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2024 - Niagara Falls, Canada
Duration: 15 Jul 202419 Jul 2024

Publication series

Name2024 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2024

Conference

Conference2024 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2024
Country/TerritoryCanada
CityNiagara Falls
Period15/07/2419/07/24

Keywords

  • Branchformer
  • E-Branchformer
  • Lip Reading
  • Visual Speech Recognition

Fingerprint

Dive into the research topics of 'Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder'. Together they form a unique fingerprint.

Cite this