Audio-Driven Talking Face Generation with Segmented Static Facial References for Customized Health Device Interactions

Zige Wang, Yashuai Wang, Tianyu Liu, Peng Zhang, Lei Xie, Yangming Guo

Research output: Contribution to journalArticlepeer-review

Abstract

In a variety of human-machine interaction (HMI) applications, the high-level techniques based on audio-driven talking face generation are often challenged by the issues of temporal misalignment and low-quality outputs. Recent solutions have sought to improve synchronization by maximizing the similarity between audio-visual pairs. However, the temporal disturbances introduced during the inference phase continue to limit the enhancement of generative performance. Inspired by the intrinsic connection between the segmented static facial image and the stable appearance representation, in this study, two strategies, Manual Temporal Segmentation (MTS) and Static Facial Reference (SFR), are proposed to improve performance during the inference stage. The corresponding functionality consists of: MTS involves segmenting the input video into several clips, effectively reducing the complexity of the inference process, and SFR utilizes static facial references to mitigate the temporal noise generated by dynamic sequences, thereby enhancing the quality of the generated outputs. Substantial experiments on the LRS2 and VoxCeleb2 datasets have demonstrated that the proposed strategies are able to significantly enhance inference performance with the LSE-C and LSE-D metrics, without altering the network architecture or training strategy. For effectiveness validation in realistic scenario applications, a deployment has also been conducted on the healthcare devices with the proposed solution.

Original languageEnglish
JournalIEEE Transactions on Consumer Electronics
DOIs
StateAccepted/In press - 2025

Keywords

  • AI health care
  • inference performance
  • lip synthesis
  • Talking face generation
  • video generation

Fingerprint

Dive into the research topics of 'Audio-Driven Talking Face Generation with Segmented Static Facial References for Customized Health Device Interactions'. Together they form a unique fingerprint.

Cite this