TY - GEN
T1 - Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder
AU - Wang, He
AU - Guo, Pengcheng
AU - Wan, Xucheng
AU - Zhou, Huan
AU - Xie, Lei
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Automatic lip-reading (ALR) aims to automatically tran-scribe spoken content from a speaker's silent lip motion captured in video. Current mainstream lip-reading approaches only use a single visual encoder to model input videos of a single scale. In this paper, we propose to enhance lip-reading by incorporating multi-scale video data and multi-encoder. Specifically, we first introduce a novel multi-scale lip motion extraction algorithm based on the size of the speaker's face and propose an Enhanced ResNet3D visual front-end (VFE) to extract lip features at different scales. For the multi-encoder, in addition to the mainstream Transformer and Conformer, we also incorporate the recently proposed Branch-former and E-Branchformer as visual encoders. In the experiments, we explore the influence of different video data scales and encoders on ALR system performance and fuse the texts transcribed by all ALR systems using recognizer output voting error reduction (ROVER). Finally, our proposed approach placed second in the ICME 2024 ChatCLR Challenge Task 2, with a 21.52% reduction in character error rate (CER) compared to the official baseline on the evaluation set.
AB - Automatic lip-reading (ALR) aims to automatically tran-scribe spoken content from a speaker's silent lip motion captured in video. Current mainstream lip-reading approaches only use a single visual encoder to model input videos of a single scale. In this paper, we propose to enhance lip-reading by incorporating multi-scale video data and multi-encoder. Specifically, we first introduce a novel multi-scale lip motion extraction algorithm based on the size of the speaker's face and propose an Enhanced ResNet3D visual front-end (VFE) to extract lip features at different scales. For the multi-encoder, in addition to the mainstream Transformer and Conformer, we also incorporate the recently proposed Branch-former and E-Branchformer as visual encoders. In the experiments, we explore the influence of different video data scales and encoders on ALR system performance and fuse the texts transcribed by all ALR systems using recognizer output voting error reduction (ROVER). Finally, our proposed approach placed second in the ICME 2024 ChatCLR Challenge Task 2, with a 21.52% reduction in character error rate (CER) compared to the official baseline on the evaluation set.
KW - Branchformer
KW - E-Branchformer
KW - Lip Reading
KW - Visual Speech Recognition
UR - https://www.scopus.com/pages/publications/85203393202
U2 - 10.1109/ICMEW63481.2024.10645400
DO - 10.1109/ICMEW63481.2024.10645400
M3 - 会议稿件
AN - SCOPUS:85203393202
T3 - 2024 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2024
BT - 2024 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2024
Y2 - 15 July 2024 through 19 July 2024
ER -