Audio-Driven Talking Face Generation with Segmented Static Facial References for Customized Health Device Interactions

Zige Wang; Yashuai Wang; Tianyu Liu; Peng Zhang; Lei Xie; Yangming Guo

doi:10.1109/TCE.2025.3565518

Audio-Driven Talking Face Generation with Segmented Static Facial References for Customized Health Device Interactions

Zige Wang, Yashuai Wang, Tianyu Liu, Peng Zhang, Lei Xie, Yangming Guo

Research output: Contribution to journal › Article › peer-review

Abstract

In a variety of human-machine interaction (HMI) applications, the high-level techniques based on audio-driven talking face generation are often challenged by the issues of temporal misalignment and low-quality outputs. Recent solutions have sought to improve synchronization by maximizing the similarity between audio-visual pairs. However, the temporal disturbances introduced during the inference phase continue to limit the enhancement of generative performance. Inspired by the intrinsic connection between the segmented static facial image and the stable appearance representation, in this study, two strategies, Manual Temporal Segmentation (MTS) and Static Facial Reference (SFR), are proposed to improve performance during the inference stage. The corresponding functionality consists of: MTS involves segmenting the input video into several clips, effectively reducing the complexity of the inference process, and SFR utilizes static facial references to mitigate the temporal noise generated by dynamic sequences, thereby enhancing the quality of the generated outputs. Substantial experiments on the LRS2 and VoxCeleb2 datasets have demonstrated that the proposed strategies are able to significantly enhance inference performance with the LSE-C and LSE-D metrics, without altering the network architecture or training strategy. For effectiveness validation in realistic scenario applications, a deployment has also been conducted on the healthcare devices with the proposed solution.

Original language	English
Journal	IEEE Transactions on Consumer Electronics
DOIs	https://doi.org/10.1109/TCE.2025.3565518
State	Accepted/In press - 2025

Keywords

AI health care
inference performance
lip synthesis
Talking face generation
video generation

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1109/TCE.2025.3565518

Cite this

@article{db53fd73f2a4499881b8ae7996356792,

title = "Audio-Driven Talking Face Generation with Segmented Static Facial References for Customized Health Device Interactions",

abstract = "In a variety of human-machine interaction (HMI) applications, the high-level techniques based on audio-driven talking face generation are often challenged by the issues of temporal misalignment and low-quality outputs. Recent solutions have sought to improve synchronization by maximizing the similarity between audio-visual pairs. However, the temporal disturbances introduced during the inference phase continue to limit the enhancement of generative performance. Inspired by the intrinsic connection between the segmented static facial image and the stable appearance representation, in this study, two strategies, Manual Temporal Segmentation (MTS) and Static Facial Reference (SFR), are proposed to improve performance during the inference stage. The corresponding functionality consists of: MTS involves segmenting the input video into several clips, effectively reducing the complexity of the inference process, and SFR utilizes static facial references to mitigate the temporal noise generated by dynamic sequences, thereby enhancing the quality of the generated outputs. Substantial experiments on the LRS2 and VoxCeleb2 datasets have demonstrated that the proposed strategies are able to significantly enhance inference performance with the LSE-C and LSE-D metrics, without altering the network architecture or training strategy. For effectiveness validation in realistic scenario applications, a deployment has also been conducted on the healthcare devices with the proposed solution.",

keywords = "AI health care, inference performance, lip synthesis, Talking face generation, video generation",

author = "Zige Wang and Yashuai Wang and Tianyu Liu and Peng Zhang and Lei Xie and Yangming Guo",

note = "Publisher Copyright: {\textcopyright} 1975-2011 IEEE.",

year = "2025",

doi = "10.1109/TCE.2025.3565518",

language = "英语",

journal = "IEEE Transactions on Consumer Electronics",

issn = "0098-3063",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Audio-Driven Talking Face Generation with Segmented Static Facial References for Customized Health Device Interactions

AU - Wang, Zige

AU - Wang, Yashuai

AU - Liu, Tianyu

AU - Zhang, Peng

AU - Xie, Lei

AU - Guo, Yangming

PY - 2025

Y1 - 2025

N2 - In a variety of human-machine interaction (HMI) applications, the high-level techniques based on audio-driven talking face generation are often challenged by the issues of temporal misalignment and low-quality outputs. Recent solutions have sought to improve synchronization by maximizing the similarity between audio-visual pairs. However, the temporal disturbances introduced during the inference phase continue to limit the enhancement of generative performance. Inspired by the intrinsic connection between the segmented static facial image and the stable appearance representation, in this study, two strategies, Manual Temporal Segmentation (MTS) and Static Facial Reference (SFR), are proposed to improve performance during the inference stage. The corresponding functionality consists of: MTS involves segmenting the input video into several clips, effectively reducing the complexity of the inference process, and SFR utilizes static facial references to mitigate the temporal noise generated by dynamic sequences, thereby enhancing the quality of the generated outputs. Substantial experiments on the LRS2 and VoxCeleb2 datasets have demonstrated that the proposed strategies are able to significantly enhance inference performance with the LSE-C and LSE-D metrics, without altering the network architecture or training strategy. For effectiveness validation in realistic scenario applications, a deployment has also been conducted on the healthcare devices with the proposed solution.

AB - In a variety of human-machine interaction (HMI) applications, the high-level techniques based on audio-driven talking face generation are often challenged by the issues of temporal misalignment and low-quality outputs. Recent solutions have sought to improve synchronization by maximizing the similarity between audio-visual pairs. However, the temporal disturbances introduced during the inference phase continue to limit the enhancement of generative performance. Inspired by the intrinsic connection between the segmented static facial image and the stable appearance representation, in this study, two strategies, Manual Temporal Segmentation (MTS) and Static Facial Reference (SFR), are proposed to improve performance during the inference stage. The corresponding functionality consists of: MTS involves segmenting the input video into several clips, effectively reducing the complexity of the inference process, and SFR utilizes static facial references to mitigate the temporal noise generated by dynamic sequences, thereby enhancing the quality of the generated outputs. Substantial experiments on the LRS2 and VoxCeleb2 datasets have demonstrated that the proposed strategies are able to significantly enhance inference performance with the LSE-C and LSE-D metrics, without altering the network architecture or training strategy. For effectiveness validation in realistic scenario applications, a deployment has also been conducted on the healthcare devices with the proposed solution.

KW - AI health care

KW - inference performance

KW - lip synthesis

KW - Talking face generation

KW - video generation

UR - http://www.scopus.com/inward/record.url?scp=105004039008&partnerID=8YFLogxK

U2 - 10.1109/TCE.2025.3565518

DO - 10.1109/TCE.2025.3565518

M3 - 文章

AN - SCOPUS:105004039008

SN - 0098-3063

JO - IEEE Transactions on Consumer Electronics

JF - IEEE Transactions on Consumer Electronics

ER -

Audio-Driven Talking Face Generation with Segmented Static Facial References for Customized Health Device Interactions

Abstract

Keywords

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this