Audio-Driven Talking Face Generation with Segmented Static Facial References for Customized Health Device Interactions

Zige Wang, Yashuai Wang, Tianyu Liu, Peng Zhang, Lei Xie, Yangming Guo

科研成果: 期刊稿件文章同行评审

摘要

In a variety of human-machine interaction (HMI) applications, the high-level techniques based on audio-driven talking face generation are often challenged by the issues of temporal misalignment and low-quality outputs. Recent solutions have sought to improve synchronization by maximizing the similarity between audio-visual pairs. However, the temporal disturbances introduced during the inference phase continue to limit the enhancement of generative performance. Inspired by the intrinsic connection between the segmented static facial image and the stable appearance representation, in this study, two strategies, Manual Temporal Segmentation (MTS) and Static Facial Reference (SFR), are proposed to improve performance during the inference stage. The corresponding functionality consists of: MTS involves segmenting the input video into several clips, effectively reducing the complexity of the inference process, and SFR utilizes static facial references to mitigate the temporal noise generated by dynamic sequences, thereby enhancing the quality of the generated outputs. Substantial experiments on the LRS2 and VoxCeleb2 datasets have demonstrated that the proposed strategies are able to significantly enhance inference performance with the LSE-C and LSE-D metrics, without altering the network architecture or training strategy. For effectiveness validation in realistic scenario applications, a deployment has also been conducted on the healthcare devices with the proposed solution.

源语言英语
期刊IEEE Transactions on Consumer Electronics
DOI
出版状态已接受/待刊 - 2025

指纹

探究 'Audio-Driven Talking Face Generation with Segmented Static Facial References for Customized Health Device Interactions' 的科研主题。它们共同构成独一无二的指纹。

引用此