Audio-Driven Talking Face Generation with Segmented Static Facial References for Customized Health Device Interactions

Zige Wang; Yashuai Wang; Tianyu Liu; Peng Zhang; Lei Xie; Yangming Guo

doi:10.1109/TCE.2025.3565518

Audio-Driven Talking Face Generation with Segmented Static Facial References for Customized Health Device Interactions

Zige Wang, Yashuai Wang, Tianyu Liu, Peng Zhang, Lei Xie, Yangming Guo

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

In a variety of human-machine interaction (HMI) applications, the high-level techniques based on audio-driven talking face generation are often challenged by the issues of temporal misalignment and low-quality outputs. Recent solutions have sought to improve synchronization by maximizing the similarity between audio-visual pairs. However, the temporal disturbances introduced during the inference phase continue to limit the enhancement of generative performance. Inspired by the intrinsic connection between the segmented static facial image and the stable appearance representation, in this study, two strategies, Manual Temporal Segmentation (MTS) and Static Facial Reference (SFR), are proposed to improve performance during the inference stage. The corresponding functionality consists of: MTS involves segmenting the input video into several clips, effectively reducing the complexity of the inference process, and SFR utilizes static facial references to mitigate the temporal noise generated by dynamic sequences, thereby enhancing the quality of the generated outputs. Substantial experiments on the LRS2 and VoxCeleb2 datasets have demonstrated that the proposed strategies are able to significantly enhance inference performance with the LSE-C and LSE-D metrics, without altering the network architecture or training strategy. For effectiveness validation in realistic scenario applications, a deployment has also been conducted on the healthcare devices with the proposed solution.

源语言	英语
期刊	IEEE Transactions on Consumer Electronics
DOI	https://doi.org/10.1109/TCE.2025.3565518
出版状态	已接受/待刊 - 2025

联合国可持续发展目标

此成果有助于实现下列可持续发展目标：

访问文件

10.1109/TCE.2025.3565518

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{db53fd73f2a4499881b8ae7996356792,

title = "Audio-Driven Talking Face Generation with Segmented Static Facial References for Customized Health Device Interactions",

abstract = "In a variety of human-machine interaction (HMI) applications, the high-level techniques based on audio-driven talking face generation are often challenged by the issues of temporal misalignment and low-quality outputs. Recent solutions have sought to improve synchronization by maximizing the similarity between audio-visual pairs. However, the temporal disturbances introduced during the inference phase continue to limit the enhancement of generative performance. Inspired by the intrinsic connection between the segmented static facial image and the stable appearance representation, in this study, two strategies, Manual Temporal Segmentation (MTS) and Static Facial Reference (SFR), are proposed to improve performance during the inference stage. The corresponding functionality consists of: MTS involves segmenting the input video into several clips, effectively reducing the complexity of the inference process, and SFR utilizes static facial references to mitigate the temporal noise generated by dynamic sequences, thereby enhancing the quality of the generated outputs. Substantial experiments on the LRS2 and VoxCeleb2 datasets have demonstrated that the proposed strategies are able to significantly enhance inference performance with the LSE-C and LSE-D metrics, without altering the network architecture or training strategy. For effectiveness validation in realistic scenario applications, a deployment has also been conducted on the healthcare devices with the proposed solution.",

keywords = "AI health care, inference performance, lip synthesis, Talking face generation, video generation",

author = "Zige Wang and Yashuai Wang and Tianyu Liu and Peng Zhang and Lei Xie and Yangming Guo",

note = "Publisher Copyright: {\textcopyright} 1975-2011 IEEE.",

year = "2025",

doi = "10.1109/TCE.2025.3565518",

language = "英语",

journal = "IEEE Transactions on Consumer Electronics",

issn = "0098-3063",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Audio-Driven Talking Face Generation with Segmented Static Facial References for Customized Health Device Interactions

AU - Wang, Zige

AU - Wang, Yashuai

AU - Liu, Tianyu

AU - Zhang, Peng

AU - Xie, Lei

AU - Guo, Yangming

PY - 2025

Y1 - 2025

N2 - In a variety of human-machine interaction (HMI) applications, the high-level techniques based on audio-driven talking face generation are often challenged by the issues of temporal misalignment and low-quality outputs. Recent solutions have sought to improve synchronization by maximizing the similarity between audio-visual pairs. However, the temporal disturbances introduced during the inference phase continue to limit the enhancement of generative performance. Inspired by the intrinsic connection between the segmented static facial image and the stable appearance representation, in this study, two strategies, Manual Temporal Segmentation (MTS) and Static Facial Reference (SFR), are proposed to improve performance during the inference stage. The corresponding functionality consists of: MTS involves segmenting the input video into several clips, effectively reducing the complexity of the inference process, and SFR utilizes static facial references to mitigate the temporal noise generated by dynamic sequences, thereby enhancing the quality of the generated outputs. Substantial experiments on the LRS2 and VoxCeleb2 datasets have demonstrated that the proposed strategies are able to significantly enhance inference performance with the LSE-C and LSE-D metrics, without altering the network architecture or training strategy. For effectiveness validation in realistic scenario applications, a deployment has also been conducted on the healthcare devices with the proposed solution.

AB - In a variety of human-machine interaction (HMI) applications, the high-level techniques based on audio-driven talking face generation are often challenged by the issues of temporal misalignment and low-quality outputs. Recent solutions have sought to improve synchronization by maximizing the similarity between audio-visual pairs. However, the temporal disturbances introduced during the inference phase continue to limit the enhancement of generative performance. Inspired by the intrinsic connection between the segmented static facial image and the stable appearance representation, in this study, two strategies, Manual Temporal Segmentation (MTS) and Static Facial Reference (SFR), are proposed to improve performance during the inference stage. The corresponding functionality consists of: MTS involves segmenting the input video into several clips, effectively reducing the complexity of the inference process, and SFR utilizes static facial references to mitigate the temporal noise generated by dynamic sequences, thereby enhancing the quality of the generated outputs. Substantial experiments on the LRS2 and VoxCeleb2 datasets have demonstrated that the proposed strategies are able to significantly enhance inference performance with the LSE-C and LSE-D metrics, without altering the network architecture or training strategy. For effectiveness validation in realistic scenario applications, a deployment has also been conducted on the healthcare devices with the proposed solution.

KW - AI health care

KW - inference performance

KW - lip synthesis

KW - Talking face generation

KW - video generation

UR - http://www.scopus.com/inward/record.url?scp=105004039008&partnerID=8YFLogxK

U2 - 10.1109/TCE.2025.3565518

DO - 10.1109/TCE.2025.3565518

M3 - 文章

AN - SCOPUS:105004039008

SN - 0098-3063

JO - IEEE Transactions on Consumer Electronics

JF - IEEE Transactions on Consumer Electronics

ER -

Audio-Driven Talking Face Generation with Segmented Static Facial References for Customized Health Device Interactions

摘要

联合国可持续发展目标

访问文件

其它文件与链接

指纹

引用此