Facial Action Unit Representation Based on Self-Supervised Learning With Ensembled Priori Constraints

Haifeng Chen; Peng Zhang; Chujia Guo; Ke Lu; Dongmei Jiang

doi:10.1109/TIP.2024.3446250

Facial Action Unit Representation Based on Self-Supervised Learning With Ensembled Priori Constraints

Haifeng Chen, Peng Zhang, Chujia Guo, Ke Lu, Dongmei Jiang

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Facial action units (AUs) focus on a comprehensive set of atomic facial muscle movements for human expression understanding. Based on supervised learning, discriminative AU representation can be achieved from local patches where the AUs are located. Unfortunately, accurate AU localization and characterization are challenged by the tremendous manual annotations, which limits the performance of AU recognition in realistic scenarios. In this study, we propose an end-to-end self-supervised AU representation learning model (SsupAU) to learn AU representations from unlabeled facial videos. Specifically, the input face is decomposed into six components using auto-encoders: five photo-geometric meaningful components, together with 2D flow field AUs. By constructing the canonical neutral face, posed neutral face, and posed expressional face gradually, these components can be disentangled without supervision, therefore the AU representations can be learned. To construct the canonical neutral face without manually labeled ground truth of emotion state or AU intensity, two priori knowledge based assumptions are proposed: 1) identity consistency, which explores the identical albedos and depths of different frames in a face video, and helps to learn the camera color mode as an extra cue for canonical neutral face recovery. 2) average face, which enables the model to discover a 'neutral facial expression' of the canonical neutral face and decouple the AUs in representation learning. To the best of our knowledge, this is the first attempt to design self-supervised AU representation learning method based on the definition of AUs. Substantial experiments on benchmark datasets have demonstrated the superior performance of the proposed work in comparison to other state-of-the-art approaches, as well as an outstanding capability of decomposing input face into meaningful factors for its reconstruction.

源语言	英语
页（从-至）	5045-5059
页数	15
期刊	IEEE Transactions on Image Processing
卷	33
DOI	https://doi.org/10.1109/TIP.2024.3446250
出版状态	已出版 - 2024

访问文件

10.1109/TIP.2024.3446250

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{83ee4d2427174d2e91737cac88c60ab0,

title = "Facial Action Unit Representation Based on Self-Supervised Learning With Ensembled Priori Constraints",

abstract = "Facial action units (AUs) focus on a comprehensive set of atomic facial muscle movements for human expression understanding. Based on supervised learning, discriminative AU representation can be achieved from local patches where the AUs are located. Unfortunately, accurate AU localization and characterization are challenged by the tremendous manual annotations, which limits the performance of AU recognition in realistic scenarios. In this study, we propose an end-to-end self-supervised AU representation learning model (SsupAU) to learn AU representations from unlabeled facial videos. Specifically, the input face is decomposed into six components using auto-encoders: five photo-geometric meaningful components, together with 2D flow field AUs. By constructing the canonical neutral face, posed neutral face, and posed expressional face gradually, these components can be disentangled without supervision, therefore the AU representations can be learned. To construct the canonical neutral face without manually labeled ground truth of emotion state or AU intensity, two priori knowledge based assumptions are proposed: 1) identity consistency, which explores the identical albedos and depths of different frames in a face video, and helps to learn the camera color mode as an extra cue for canonical neutral face recovery. 2) average face, which enables the model to discover a 'neutral facial expression' of the canonical neutral face and decouple the AUs in representation learning. To the best of our knowledge, this is the first attempt to design self-supervised AU representation learning method based on the definition of AUs. Substantial experiments on benchmark datasets have demonstrated the superior performance of the proposed work in comparison to other state-of-the-art approaches, as well as an outstanding capability of decomposing input face into meaningful factors for its reconstruction.",

keywords = "AU representation, average face, canonical neutral face, identity consistency, self-supervised learning",

author = "Haifeng Chen and Peng Zhang and Chujia Guo and Ke Lu and Dongmei Jiang",

note = "Publisher Copyright: {\textcopyright} 1992-2012 IEEE.",

year = "2024",

doi = "10.1109/TIP.2024.3446250",

language = "英语",

volume = "33",

pages = "5045--5059",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Facial Action Unit Representation Based on Self-Supervised Learning With Ensembled Priori Constraints

AU - Chen, Haifeng

AU - Zhang, Peng

AU - Guo, Chujia

AU - Lu, Ke

AU - Jiang, Dongmei

PY - 2024

Y1 - 2024

N2 - Facial action units (AUs) focus on a comprehensive set of atomic facial muscle movements for human expression understanding. Based on supervised learning, discriminative AU representation can be achieved from local patches where the AUs are located. Unfortunately, accurate AU localization and characterization are challenged by the tremendous manual annotations, which limits the performance of AU recognition in realistic scenarios. In this study, we propose an end-to-end self-supervised AU representation learning model (SsupAU) to learn AU representations from unlabeled facial videos. Specifically, the input face is decomposed into six components using auto-encoders: five photo-geometric meaningful components, together with 2D flow field AUs. By constructing the canonical neutral face, posed neutral face, and posed expressional face gradually, these components can be disentangled without supervision, therefore the AU representations can be learned. To construct the canonical neutral face without manually labeled ground truth of emotion state or AU intensity, two priori knowledge based assumptions are proposed: 1) identity consistency, which explores the identical albedos and depths of different frames in a face video, and helps to learn the camera color mode as an extra cue for canonical neutral face recovery. 2) average face, which enables the model to discover a 'neutral facial expression' of the canonical neutral face and decouple the AUs in representation learning. To the best of our knowledge, this is the first attempt to design self-supervised AU representation learning method based on the definition of AUs. Substantial experiments on benchmark datasets have demonstrated the superior performance of the proposed work in comparison to other state-of-the-art approaches, as well as an outstanding capability of decomposing input face into meaningful factors for its reconstruction.

AB - Facial action units (AUs) focus on a comprehensive set of atomic facial muscle movements for human expression understanding. Based on supervised learning, discriminative AU representation can be achieved from local patches where the AUs are located. Unfortunately, accurate AU localization and characterization are challenged by the tremendous manual annotations, which limits the performance of AU recognition in realistic scenarios. In this study, we propose an end-to-end self-supervised AU representation learning model (SsupAU) to learn AU representations from unlabeled facial videos. Specifically, the input face is decomposed into six components using auto-encoders: five photo-geometric meaningful components, together with 2D flow field AUs. By constructing the canonical neutral face, posed neutral face, and posed expressional face gradually, these components can be disentangled without supervision, therefore the AU representations can be learned. To construct the canonical neutral face without manually labeled ground truth of emotion state or AU intensity, two priori knowledge based assumptions are proposed: 1) identity consistency, which explores the identical albedos and depths of different frames in a face video, and helps to learn the camera color mode as an extra cue for canonical neutral face recovery. 2) average face, which enables the model to discover a 'neutral facial expression' of the canonical neutral face and decouple the AUs in representation learning. To the best of our knowledge, this is the first attempt to design self-supervised AU representation learning method based on the definition of AUs. Substantial experiments on benchmark datasets have demonstrated the superior performance of the proposed work in comparison to other state-of-the-art approaches, as well as an outstanding capability of decomposing input face into meaningful factors for its reconstruction.

KW - AU representation

KW - average face

KW - canonical neutral face

KW - identity consistency

KW - self-supervised learning

UR - http://www.scopus.com/inward/record.url?scp=85202785742&partnerID=8YFLogxK

U2 - 10.1109/TIP.2024.3446250

DO - 10.1109/TIP.2024.3446250

M3 - 文章

C2 - 39186413

AN - SCOPUS:85202785742

SN - 1057-7149

VL - 33

SP - 5045

EP - 5059

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

ER -

Facial Action Unit Representation Based on Self-Supervised Learning With Ensembled Priori Constraints

摘要

访问文件

其它文件与链接

指纹

引用此