TY - JOUR
T1 - Audio-Visual Kinship Verification
T2 - A New Dataset and a Unified Adaptive Adversarial Multimodal Learning Approach
AU - Wu, Xiaoting
AU - Zhang, Xueyi
AU - Feng, Xiaoyi
AU - Bordallo Lopez, Miguel
AU - Liu, Li
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2024/3/1
Y1 - 2024/3/1
N2 - Facial kinship verification refers to automatically determining whether two people have a kin relation from their faces. It has become a popular research topic due to potential practical applications. Over the past decade, many efforts have been devoted to improving the verification performance from human faces only while lacking other biometric information, for example, speaking voice. In this article, to interpret and benefit from multiple modalities, we propose for the first time to combine human faces and voices to verify kinship, which we refer it as the audio-visual kinship verification study. We first establish a comprehensive audio-visual kinship dataset that consists of familial talking facial videos under various scenarios, called TALKIN-Family. Based on the dataset, we present the extensive evaluation of kinship verification from faces and voices. In particular, we propose a deep-learning-based fusion method, called unified adaptive adversarial multimodal learning (UAAML). It consists of the adversarial network and the attention module on the basis of unified multimodal features. Experiments show that audio (voice) information is complementary to facial features and useful for the kinship verification problem. Furthermore, the proposed fusion method outperforms baseline methods. In addition, we also evaluate the human verification ability on a subset of TALKIN-Family. It indicates that humans have higher accuracy when they have access to both faces and voices. The machine-learning methods could effectively and efficiently outperform the human ability. Finally, we include the future work and research opportunities with the TALKIN-Family dataset.
AB - Facial kinship verification refers to automatically determining whether two people have a kin relation from their faces. It has become a popular research topic due to potential practical applications. Over the past decade, many efforts have been devoted to improving the verification performance from human faces only while lacking other biometric information, for example, speaking voice. In this article, to interpret and benefit from multiple modalities, we propose for the first time to combine human faces and voices to verify kinship, which we refer it as the audio-visual kinship verification study. We first establish a comprehensive audio-visual kinship dataset that consists of familial talking facial videos under various scenarios, called TALKIN-Family. Based on the dataset, we present the extensive evaluation of kinship verification from faces and voices. In particular, we propose a deep-learning-based fusion method, called unified adaptive adversarial multimodal learning (UAAML). It consists of the adversarial network and the attention module on the basis of unified multimodal features. Experiments show that audio (voice) information is complementary to facial features and useful for the kinship verification problem. Furthermore, the proposed fusion method outperforms baseline methods. In addition, we also evaluate the human verification ability on a subset of TALKIN-Family. It indicates that humans have higher accuracy when they have access to both faces and voices. The machine-learning methods could effectively and efficiently outperform the human ability. Finally, we include the future work and research opportunities with the TALKIN-Family dataset.
KW - Adversarial learning
KW - audio-visual
KW - benchmark dataset
KW - kinship verification
KW - multimodal fusion
UR - http://www.scopus.com/inward/record.url?scp=85144056155&partnerID=8YFLogxK
U2 - 10.1109/TCYB.2022.3220040
DO - 10.1109/TCYB.2022.3220040
M3 - 文章
C2 - 36417714
AN - SCOPUS:85144056155
SN - 2168-2267
VL - 54
SP - 1523
EP - 1536
JO - IEEE Transactions on Cybernetics
JF - IEEE Transactions on Cybernetics
IS - 3
ER -