Facial Action Unit Representation Based on Self-Supervised Learning With Ensembled Priori Constraints

Haifeng Chen; Peng Zhang; Chujia Guo; Ke Lu; Dongmei Jiang

doi:10.1109/TIP.2024.3446250

Facial Action Unit Representation Based on Self-Supervised Learning With Ensembled Priori Constraints

Haifeng Chen, Peng Zhang, Chujia Guo, Ke Lu, Dongmei Jiang

School of Computer Science

Research output: Contribution to journal › Article › peer-review

Abstract

Facial action units (AUs) focus on a comprehensive set of atomic facial muscle movements for human expression understanding. Based on supervised learning, discriminative AU representation can be achieved from local patches where the AUs are located. Unfortunately, accurate AU localization and characterization are challenged by the tremendous manual annotations, which limits the performance of AU recognition in realistic scenarios. In this study, we propose an end-to-end self-supervised AU representation learning model (SsupAU) to learn AU representations from unlabeled facial videos. Specifically, the input face is decomposed into six components using auto-encoders: five photo-geometric meaningful components, together with 2D flow field AUs. By constructing the canonical neutral face, posed neutral face, and posed expressional face gradually, these components can be disentangled without supervision, therefore the AU representations can be learned. To construct the canonical neutral face without manually labeled ground truth of emotion state or AU intensity, two priori knowledge based assumptions are proposed: 1) identity consistency, which explores the identical albedos and depths of different frames in a face video, and helps to learn the camera color mode as an extra cue for canonical neutral face recovery. 2) average face, which enables the model to discover a 'neutral facial expression' of the canonical neutral face and decouple the AUs in representation learning. To the best of our knowledge, this is the first attempt to design self-supervised AU representation learning method based on the definition of AUs. Substantial experiments on benchmark datasets have demonstrated the superior performance of the proposed work in comparison to other state-of-the-art approaches, as well as an outstanding capability of decomposing input face into meaningful factors for its reconstruction.

Original language	English
Pages (from-to)	5045-5059
Number of pages	15
Journal	IEEE Transactions on Image Processing
Volume	33
DOIs	https://doi.org/10.1109/TIP.2024.3446250
State	Published - 2024

Keywords

AU representation
average face
canonical neutral face
identity consistency
self-supervised learning

Access to Document

10.1109/TIP.2024.3446250

Cite this

@article{83ee4d2427174d2e91737cac88c60ab0,

title = "Facial Action Unit Representation Based on Self-Supervised Learning With Ensembled Priori Constraints",

abstract = "Facial action units (AUs) focus on a comprehensive set of atomic facial muscle movements for human expression understanding. Based on supervised learning, discriminative AU representation can be achieved from local patches where the AUs are located. Unfortunately, accurate AU localization and characterization are challenged by the tremendous manual annotations, which limits the performance of AU recognition in realistic scenarios. In this study, we propose an end-to-end self-supervised AU representation learning model (SsupAU) to learn AU representations from unlabeled facial videos. Specifically, the input face is decomposed into six components using auto-encoders: five photo-geometric meaningful components, together with 2D flow field AUs. By constructing the canonical neutral face, posed neutral face, and posed expressional face gradually, these components can be disentangled without supervision, therefore the AU representations can be learned. To construct the canonical neutral face without manually labeled ground truth of emotion state or AU intensity, two priori knowledge based assumptions are proposed: 1) identity consistency, which explores the identical albedos and depths of different frames in a face video, and helps to learn the camera color mode as an extra cue for canonical neutral face recovery. 2) average face, which enables the model to discover a 'neutral facial expression' of the canonical neutral face and decouple the AUs in representation learning. To the best of our knowledge, this is the first attempt to design self-supervised AU representation learning method based on the definition of AUs. Substantial experiments on benchmark datasets have demonstrated the superior performance of the proposed work in comparison to other state-of-the-art approaches, as well as an outstanding capability of decomposing input face into meaningful factors for its reconstruction.",

keywords = "AU representation, average face, canonical neutral face, identity consistency, self-supervised learning",

author = "Haifeng Chen and Peng Zhang and Chujia Guo and Ke Lu and Dongmei Jiang",

note = "Publisher Copyright: {\textcopyright} 1992-2012 IEEE.",

year = "2024",

doi = "10.1109/TIP.2024.3446250",

language = "英语",

volume = "33",

pages = "5045--5059",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Facial Action Unit Representation Based on Self-Supervised Learning With Ensembled Priori Constraints

AU - Chen, Haifeng

AU - Zhang, Peng

AU - Guo, Chujia

AU - Lu, Ke

AU - Jiang, Dongmei

PY - 2024

Y1 - 2024

N2 - Facial action units (AUs) focus on a comprehensive set of atomic facial muscle movements for human expression understanding. Based on supervised learning, discriminative AU representation can be achieved from local patches where the AUs are located. Unfortunately, accurate AU localization and characterization are challenged by the tremendous manual annotations, which limits the performance of AU recognition in realistic scenarios. In this study, we propose an end-to-end self-supervised AU representation learning model (SsupAU) to learn AU representations from unlabeled facial videos. Specifically, the input face is decomposed into six components using auto-encoders: five photo-geometric meaningful components, together with 2D flow field AUs. By constructing the canonical neutral face, posed neutral face, and posed expressional face gradually, these components can be disentangled without supervision, therefore the AU representations can be learned. To construct the canonical neutral face without manually labeled ground truth of emotion state or AU intensity, two priori knowledge based assumptions are proposed: 1) identity consistency, which explores the identical albedos and depths of different frames in a face video, and helps to learn the camera color mode as an extra cue for canonical neutral face recovery. 2) average face, which enables the model to discover a 'neutral facial expression' of the canonical neutral face and decouple the AUs in representation learning. To the best of our knowledge, this is the first attempt to design self-supervised AU representation learning method based on the definition of AUs. Substantial experiments on benchmark datasets have demonstrated the superior performance of the proposed work in comparison to other state-of-the-art approaches, as well as an outstanding capability of decomposing input face into meaningful factors for its reconstruction.

AB - Facial action units (AUs) focus on a comprehensive set of atomic facial muscle movements for human expression understanding. Based on supervised learning, discriminative AU representation can be achieved from local patches where the AUs are located. Unfortunately, accurate AU localization and characterization are challenged by the tremendous manual annotations, which limits the performance of AU recognition in realistic scenarios. In this study, we propose an end-to-end self-supervised AU representation learning model (SsupAU) to learn AU representations from unlabeled facial videos. Specifically, the input face is decomposed into six components using auto-encoders: five photo-geometric meaningful components, together with 2D flow field AUs. By constructing the canonical neutral face, posed neutral face, and posed expressional face gradually, these components can be disentangled without supervision, therefore the AU representations can be learned. To construct the canonical neutral face without manually labeled ground truth of emotion state or AU intensity, two priori knowledge based assumptions are proposed: 1) identity consistency, which explores the identical albedos and depths of different frames in a face video, and helps to learn the camera color mode as an extra cue for canonical neutral face recovery. 2) average face, which enables the model to discover a 'neutral facial expression' of the canonical neutral face and decouple the AUs in representation learning. To the best of our knowledge, this is the first attempt to design self-supervised AU representation learning method based on the definition of AUs. Substantial experiments on benchmark datasets have demonstrated the superior performance of the proposed work in comparison to other state-of-the-art approaches, as well as an outstanding capability of decomposing input face into meaningful factors for its reconstruction.

KW - AU representation

KW - average face

KW - canonical neutral face

KW - identity consistency

KW - self-supervised learning

UR - http://www.scopus.com/inward/record.url?scp=85202785742&partnerID=8YFLogxK

U2 - 10.1109/TIP.2024.3446250

DO - 10.1109/TIP.2024.3446250

M3 - 文章

C2 - 39186413

AN - SCOPUS:85202785742

SN - 1057-7149

VL - 33

SP - 5045

EP - 5059

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

ER -

Facial Action Unit Representation Based on Self-Supervised Learning With Ensembled Priori Constraints

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this